查看原文
其他

s05 - Finding a Motif in DNA

Y叔 biobabble 2020-02-03

Problem

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of ‘U’ in “AUGCUUCAGAAAGGUCUUACG” are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = “AUGCUUCAGAAAGGUCUUACG”, then s[2:5] = “UGCU”.

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.

Sample Dataset

GATATATGCATATACTT ATAT

Sample Output

2 4 10

Solution

这个问题,俗称大海捞针,这里不管是C语言版本还是python,我都用了sliding window,这个效率不高,但目前这问题的复杂度还不至于要上高效的字符串搜索算法。

C语言版本

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

#define SZ 1000

int * get_substring_index(char *t, char *s);
int is_string_equal(char *t, char *s, int string_len);

int main() {  FILE *INFILE;  INFILE = fopen("DATA/rosalind_subs.txt", "r");
 char s[SZ], t[SZ];
 fscanf(INFILE, "%s\n%s", s, t);  
 int *pos;  pos = get_substring_index(t, s);  
 int i=0;  
 while(pos[i]) {  
   printf("%d\t", pos[i]);    i++;  }  
 printf("\n");  
 return 0; }

int
* get_substring_index(char *t, char *s) {  
 // t is a substring of s  int slen = strlen(s);  
 int tlen = strlen(t);  
 if (slen < tlen)
   return NULL;
 int *position = NULL;  position = (int *) malloc(slen * sizeof(int));
 int i;  int j = 0;  
 for (i=0; i<=slen-tlen; i++) {  
   if ( is_string_equal(t, s+i, tlen) ) {      position[j++] = i + 1;    }  }  position[j] = '\0';  
 return position; }

int
is_string_equal(char *t, char *s, int string_len) {
 if (strlen(s) < string_len || strlen(t) < string_len) {
     return 0;  }
 int res = strncmp(t, s, string_len);
   if (res != 0)    
      return 0;  
   return 1; }

python版本

#!/usr/bin/env python3 f=open("DATA/rosalind_subs.txt", 'r') seq =  f.readlines() s = seq[0].strip() t = seq[1].strip() p = [] for i in range(len(s) - len(t)):    if s[i:i+len(t)] == t:        p.append(i+1) for x in p:    print(x, end=" ") print()

    您可能也对以下帖子感兴趣

    文章有问题?点此查看未经处理的缓存