查看原文
其他

搭建本地blast

2017-08-31 Y叔 biobabble

刚来HKU的时候,专家说要搞本地blast,于是我搭了一个,并做了记录。看了《来香港读博其实是被逼的》这篇文章,很多人问专家是谁,这次的记录主机名称就是专家缩写。

I was asked to set up a local blast for the lab. Blast can be installed directly using apt in debian and it turns out to be easy.

root@jz:/ssd/genomes# apt-get install ncbi-blast+ Reading package lists... Done Building dependency tree Reading state information... Done The following NEW packages will be installed:  ncbi-blast+ 0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded. Need to get 11.2 MB of archives. After this operation, 32.8 MB of additional disk space will be used. Get:1 http://ftp.hk.debian.org/debian/ wheezy/main ncbi-blast+ amd64 2.2.26-3 [11.2 MB] Fetched 11.2 MB in 1min 16s (146 kB/s) Selecting previously unselected package ncbi-blast+. (Reading database ... 252681 files and directories currently installed.) Unpacking ncbi-blast+ (from .../ncbi-blast+_2.2.26-3_amd64.deb) ... Processing triggers for man-db ... Setting up ncbi-blast+ (2.2.26-3) ...

Before the program can be used for sequence alignment, we should prepare the db files:

root@jz:/ssd/genomes/blast/db# makeblastdb -in ../../hg19.fa -out hg19 -dbtype nucl Building a new DB, current time: 11/21/2013 16:03:05 New DB name:   hg19 New DB title:  ../../hg19.fa Sequence type: Nucleotide Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 25 sequences in 27.7084 seconds.

That’s it. Now blast is supported in the lab server. I tested blastn with the AHANAK sequence.

ygc@jz:~$ date; blastn -num_threads 8 -query sequence.fasta -db /ssd/genomes/blast/db/hg19 >> testblast.txt ; date Thu Nov 21 16:19:19 HKT 2013 Thu Nov 21 17:19:45 HKT 2013 ygc@jz:~$ wc -c sequence.fasta 149510 sequence.fasta

Blast search has three distinctive stages: word matching with database scan, ungapped alignment, gapped alignment with traceback. Only word matching stage is multi-threaded. So it tooks 1 hour to align this long gene.
The most significant hit of the blast result indicated that the sequence is located at chr11:62331329-62184015, which is consistent with the information presented in the AHANAK sequence link.
   ygc@jz:~$ cat testblast.txt
   BLASTN 2.2.26+
   Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
   Miller (2000), “A greedy algorithm for aligning DNA sequences”, J
   Comput Biol 2000; 7(1-2):203-14.
   Database: ../../hg19.fa
              25 sequences; 3,095,693,983 total letters
   Query= gi|224589802:c62331329-62184015 Homo sapiens chromosome 11,
   GRCh37.p13 Primary Assembly
   Length=147315
                                                                         Score        E
   Sequences producing significant alignments:                          (Bits)     Value
     chr11                                                               2.720e+05  0.0
     chr14                                                                941       0.0
   …

> chr11 Length=135006516 Score = 2.720e+05 bits (147315),  Expect = 0.0 Identities = 147315/147315 (100%), Gaps = 0/147315 (0%) Strand=Plus/Minus Query  1         CATTTGGAAAGGAACCTGGGACCTAGGGTTAAAGGACAGATCCTCTGCTTTGCTACAGTG  60                 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct  62331329  CATTTGGAAAGGAACCTGGGACCTAGGGTTAAAGGACAGATCCTCTGCTTTGCTACAGTG  62331270 Query  61        TAATCTTGGACAGGTCATTCTACTTGGTCTCATCTGTAAAGAGAGGGGAGGGTCAGGGTA  120                 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct  62331269  TAATCTTGGACAGGTCATTCTACTTGGTCTCATCTGTAAAGAGAGGGGAGGGTCAGGGTA  62331210 ...  Database: ../../hg19.fa    Posted date:  Nov 21, 2013  4:03 PM  Number of letters in database: 3,095,693,983  Number of sequences in database:  25 Matrix: blastn matrix 1 -2 Gap Penalties: Existence: 0, Extension: 2.5

In addition to the species-specific sequence file, we should download the non-redundent gene and protein sequences from
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz and
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz respectively.
nt contains all nucleotide sequences of GenBank, RefSeq Nucleotides, EMBL, DDBJ and PDB, while nr contains all non-redundant peptide sequences from GenBank CDS translations, RefSeq Proteins, PDB, SwissProt, PIR and PRF. After generating the db files, we can blast a sequence to multiple species.

makeblastdb -in ../../nr/nr -out nr -dbtype prot makeblastdb -in ../../nr/nt -out nt -dbtype nucl


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存