s05 - Finding a Motif in DNA | 自由微信

s05 - Finding a Motif in DNA

Original Y叔 biobabble 2020-02-03

Problem

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of ‘U’ in “AUGCUUCAGAAAGGUCUUACG” are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = “AUGCUUCAGAAAGGUCUUACG”, then s[2:5] = “UGCU”.

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.

Sample Dataset

GATATATGCATATACTT
ATAT

Sample Output

2 4 10

Solution

这个问题，俗称大海捞针，这里不管是C语言版本还是python，我都用了sliding window，这个效率不高，但目前这问题的复杂度还不至于要上高效的字符串搜索算法。

C语言版本

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

#define SZ 1000

int * get_substring_index(char *t, char *s);
int is_string_equal(char *t, char *s, int string_len);

int main() {
  FILE *INFILE;
  INFILE = fopen("DATA/rosalind_subs.txt", "r");
  char s[SZ], t[SZ];
  fscanf(INFILE, "%s\n%s", s, t);  
  int *pos;
  pos = get_substring_index(t, s);  
  int i=0;  
  while(pos[i]) {  
    printf("%d\t", pos[i]);
    i++;
  }  
  printf("\n");  
  return 0;
}

int * get_substring_index(char *t, char *s) {  
  // t is a substring of s
  int slen = strlen(s);  
  int tlen = strlen(t);  
  if (slen < tlen)
    return NULL;
  int *position = NULL;
  position = (int *) malloc(slen * sizeof(int));
  int i;  int j = 0;  
  for (i=0; i<=slen-tlen; i++) {  
    if ( is_string_equal(t, s+i, tlen) ) {
      position[j++] = i + 1;
    }
  }
  position[j] = '\0';  
  return position;
}

int is_string_equal(char *t, char *s, int string_len) {
  if (strlen(s) < string_len || strlen(t) < string_len) {
      return 0;
  }
  int res = strncmp(t, s, string_len);
    if (res != 0)    
       return 0;  
    return 1;
}

python版本

#!/usr/bin/env python3

f=open("DATA/rosalind_subs.txt", 'r')

seq =  f.readlines()

s = seq[0].strip()
t = seq[1].strip()

p = []

for i in range(len(s) - len(t)):
    if s[i:i+len(t)] == t:
        p.append(i+1)

for x in p:
    print(x, end=" ")

print()

这样的洞庭湖决堤，实在让人同情不起来

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

抗洪靠嘴，堵漏靠沙？印度官员真是绝了！

有的人走了，却永远活着

圈内疯传某谣言