s07 - Finding a Shared Motif | 自由微信

s07 - Finding a Shared Motif

Original Y叔叔 biobabble 2020-02-05

Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

Sample Output

AC

Solution

这道题要求找出最长的共有子序列，解题思路从最长开始找，一旦找到就是答案。而对于每一个长度，先从第一条序列里生成一个kmer集合，然后在其它的序列中寻找。

#!/usr/bin/env python3

def readFASTA(file):
    f = open(file)
    lines = f.readlines()
    des = list()
    seqs = list()
    first_line = True
    for i in range(len(lines)):
        line = lines[i].rstrip()
        if line[0] == '>' and first_line == True:
            des.append(line[1:])
            first_line = False
            seq = ''
        elif line[0] == '>' and first_line == False:
            des.append(line[1:])
            seqs.append(seq)
            seq = ''
        else:
            seq += line
    seqs.append(seq)    
    return des, seqs
    
def lcs(strs):
    maxK = min([len(x) for x in strs])
    kmer = set()
    seq = strs[0]
    strs = strs[1:]    
    for k in reversed(range(maxK)):
        for i in range(maxK+1-k):
            kmer.add(seq[i:i+k])        
        for kk in kmer:
            found = True
            for x in strs:
                if x.find(kk) == -1:
                    found = False
                    break
            if found == True:
                return(kk)
        kmer = set()

if __name__ == "__main__":
  description, sequence = readFASTA("DATA/rosalind_lcsm.txt")
  print(lcs(sequence))

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

“家属和记者取得联系”：记者的退场意味深长

圈内疯传某谣言

不要放过这些人渣

“被指居者”之死：嫌犯身体遭长时间束缚，警方称指居使用械具是惯例