python – 用于查找总质量m的可能氨基酸序列的算法优化

发布时间：2020-12-16 21:33:19 所属栏目：Python 来源：网络整理

导读：参见英文答案 Finding Combinations to the provided Sum value2个这是一个家庭作业,我解决了问题,但我想找到一个更快的解决方案. 问题如下：我需要弄清楚有多少可能的氨基酸(aa)序列存在总质量m. 我有一个氨基酸表(单字母字符串)和相应的质量(int),我把它

参见英文答案 > Finding Combinations to the provided Sum value2个
这是一个家庭作业,我解决了问题,但我想找到一个更快的解决方案.

问题如下：我需要弄清楚有多少可能的氨基酸(aa)序列存在总质量m.
我有一个氨基酸表(单字母字符串)和相应的质量(int),我把它放在字典中.

我最初的解决方案是创建aa的所有可能组合,并将每个组合的总质量与质量m进行比较.这适用于少量的m,但是当m开始为数百时,组合的数量变得非常高.

我做了一些小的优化,并使其工作得相当快,因为??m< 500这对于这个问题已经足够好了,但我想知道如何让它适用于更高的质量. 这是我到目前为止：

totalmass = m

def pepList():
    tempList = ['']
    temp2List = []
    length = 0
    total = 0
    aminoList = 'GASPVTCINDKEMHFRYW'  #this are all the aminoacids

    while length < maxLength:
        for i in tempList:
            for j in aminoList:
                pepMass = peptideMass(i+j,massTable) #find the mass of 
                                                      #this peptide
                if pepMass == totalmass:
                    total += 1
                elif pepMass <= totalmass:
                    temp2List.append(i+j)


        tempList = []
        for i in temp2List:
            tempList.append(i)
        temp2List = []
        length = length + 1

    print (total)

pepList()

我可以在大约一秒钟内获得m = 300的解决方案,但m = 500需要大约40秒

我尝试使用itertools替代方案,但它没有更快：

total = 0
pepList = []

for i in range(maxLength+1):
    for p in itertools.combinations_with_replacement(aminoList,i): 
    #order matters for the total number of peptides but not for calculating 
    #the total mass
        amino = ''.join(p)
        if peptideMass(amino,massTable) == mass:
            pepList.append(amino)

print (len(pepList))

newpepList = []

for i in pepList:

    for p in itertools.permutations(i,r = len(i)): 
    #I use permutations here to get the total number because order matters
        if p not in newpepList:
            newpepList.append(p)

            total +=1

print (total)

样本输入：
m = 270
输出：
22

解决方法

氨基酸发生的顺序不会改变质量 – AAC的重量与ACA和CAA相同.

因此,这可以简化为线性规划问题 – 找到系数的值,使得M = a * A b * C c * D d * E e * G … r * W

一旦你有了解决方案,你就可以生成给定氨基酸组的所有可能的排列 – 或者如果你只需要排列的数量,你可以直接计算它.

编辑：

正如@Hooked指出的那样,这不是线性规划,原因有二：首先,我们需要整数系数,其次,我们正在寻找所有组合,而不是找到一个单一的最优解.

我已经制定了一个递归生成器,如下所示：

from math import floor,ceil
import profile

amino_weight = {
    'A':  71.038,'C': 103.009,'D': 115.027,'E': 129.043,'F': 147.068,'G':  57.021,'H': 137.059,'I': 113.084,'K': 128.095,'L': 113.084,# you omitted leutine?
    'M': 131.040,'N': 114.043,'P':  97.053,'Q': 128.059,# you omitted glutamine?
    'R': 156.101,'S':  87.032,'T': 101.048,'V':  99.068,'W': 186.079,'Y': 163.063
}

def get_float(prompt):
    while True:
        try:
            return float(raw_input(prompt))
        except ValueError:
            pass

# This is where the fun happens!
def get_mass_combos(aminos,pos,lo,hi,cutoff):
    this = aminos[pos]         # use a pointer into the string,to avoid copying 8 million partial strings around
    wt = amino_weight[this]
    kmax = int(floor(hi / wt))
    npos = pos - 1
    if npos:                   # more aminos to consider recursively
        for k in xrange(0,kmax + 1):
            mass    = k * wt
            nlo     = lo - mass
            nhi     = hi - mass
            ncutoff = cutoff - mass
            if nlo <= 0. and nhi >= 0.:
                # we found a winner!
                yield {this: k}
            elif ncutoff < 0.:
                # no further solution is possible
                break
            else:
                # recurse
                for cc in get_mass_combos(aminos,npos,nlo,nhi,ncutoff):
                    if k > 0: cc[this] = k
                    yield cc
    else:                      # last amino - it's this or nothing
        kmin = int(ceil(lo / wt))
        for k in xrange(kmin,kmax+1):
            yield {this: k}

def to_string(combo):
    keys = sorted(combo)
    return ''.join(k*combo[k] for k in keys)

def total_mass(combo):
    return sum(amino_weight[a]*n for a,n in combo.items())

def fact(n):
    num = 1
    for i in xrange(2,n+1):
        num *= i
    return num

def permutations(combo):
    num = 0
    div = 1
    for v in combo.values():
        num += v
        div *= fact(v)
    return fact(num) / div

def find_combos(lo,hi):
    total = 0
    bases = []
    aminos = ''.join(sorted(amino_weight,key = lambda x: amino_weight[x]))
    for combo in get_mass_combos(aminos,len(aminos)-1,hi - amino_weight[aminos[0]]):
        base = to_string(combo)
        bases.append(base)
        mass = total_mass(combo)
        cc = permutations(combo)
        total += cc
        print("{} (mass {},{} permutations)".format(base,mass,cc))
    print('Total: {} bases,{} permutations'.format(len(bases),total))

def main():
    lo = get_float('Bottom of target mass range? ')
    hi = get_float('Top of target mass range? ')

    prof = profile.Profile()
    prof.run('find_combos({},{})'.format(lo,hi))
    prof.print_stats()

if __name__=="__main__":
    main()

它还使用浮点氨基质量来寻找质量范围.在我的机器(i5-870)上搜索748.0和752.0之间的质量,返回7,505个碱基,总共9,400,528个排列,在3.82秒内.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!