查看原文
其他

s07 - Finding a Shared Motif

Y叔叔 biobabble 2020-02-05

Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

>Rosalind_1 GATTACA >Rosalind_2 TAGACCA >Rosalind_3 ATACA
Sample Output
AC

Solution

这道题要求找出最长的共有子序列,解题思路从最长开始找,一旦找到就是答案。而对于每一个长度,先从第一条序列里生成一个kmer集合,然后在其它的序列中寻找。

#!/usr/bin/env python3

def readFASTA(file):    f = open(file)    lines = f.readlines()    des = list()    seqs = list()    first_line = True    for i in range(len(lines)):        line = lines[i].rstrip()
       if line[0] == '>' and first_line == True:            des.append(line[1:])            first_line = False            seq = ''        elif line[0] == '>' and first_line == False:            des.append(line[1:])            seqs.append(seq)            seq = ''        else:            seq += line    seqs.append(seq)    
   return des, seqs
   
def lcs(strs):    maxK = min([len(x) for x in strs])    kmer = set()    seq = strs[0]    strs = strs[1:]    
   for k in reversed(range(maxK)):
       for i in range(maxK+1-k):            kmer.add(seq[i:i+k])        
       for kk in kmer:            found = True            for x in strs:
               if x.find(kk) == -1:                    found = False                    break            if found == True:
               return(kk)        kmer = set()

if __name__ == "__main__":  description, sequence = readFASTA("DATA/rosalind_lcsm.txt")  print(lcs(sequence))

电梯

    您可能也对以下帖子感兴趣

    文章有问题?点此查看未经处理的缓存