搭建本地blast

原创 2017-08-31 Y叔 biobabble

刚来HKU的时候，专家说要搞本地blast，于是我搭了一个，并做了记录。看了《来香港读博其实是被逼的》这篇文章，很多人问专家是谁，这次的记录主机名称就是专家缩写。

I was asked to set up a local blast for the lab. Blast can be installed directly using apt in debian and it turns out to be easy.

root@jz:/ssd/genomes# apt-get install ncbi-blast+
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
  ncbi-blast+
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
Need to get 11.2 MB of archives.
After this operation, 32.8 MB of additional disk space will be used.
Get:1 http://ftp.hk.debian.org/debian/ wheezy/main ncbi-blast+ amd64 2.2.26-3 [11.2 MB]
Fetched 11.2 MB in 1min 16s (146 kB/s)
Selecting previously unselected package ncbi-blast+.
(Reading database ... 252681 files and directories currently installed.)
Unpacking ncbi-blast+ (from .../ncbi-blast+_2.2.26-3_amd64.deb) ...
Processing triggers for man-db ...
Setting up ncbi-blast+ (2.2.26-3) ...

Before the program can be used for sequence alignment, we should prepare the db files:

root@jz:/ssd/genomes/blast/db# makeblastdb -in ../../hg19.fa -out hg19 -dbtype nucl
Building a new DB, current time: 11/21/2013 16:03:05
New DB name:   hg19
New DB title:  ../../hg19.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 25 sequences in 27.7084 seconds.

That’s it. Now blast is supported in the lab server. I tested blastn with the AHANAK sequence.

ygc@jz:~$ date; blastn -num_threads 8 -query sequence.fasta -db /ssd/genomes/blast/db/hg19 >> testblast.txt ; date
Thu Nov 21 16:19:19 HKT 2013
Thu Nov 21 17:19:45 HKT 2013
ygc@jz:~$ wc -c sequence.fasta
149510 sequence.fasta

Blast search has three distinctive stages: word matching with database scan, ungapped alignment, gapped alignment with traceback. Only word matching stage is multi-threaded. So it tooks 1 hour to align this long gene.
The most significant hit of the blast result indicated that the sequence is located at chr11:62331329-62184015, which is consistent with the information presented in the AHANAK sequence link.
ygc@jz:~$ cat testblast.txt
BLASTN 2.2.26+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), “A greedy algorithm for aligning DNA sequences”, J
Comput Biol 2000; 7(1-2):203-14.
Database: ../../hg19.fa
25 sequences; 3,095,693,983 total letters
Query= gi|224589802:c62331329-62184015 Homo sapiens chromosome 11,
GRCh37.p13 Primary Assembly
Length=147315
Score E
Sequences producing significant alignments: (Bits) Value
chr11 2.720e+05 0.0
chr14 941 0.0
…

> chr11
Length=135006516
 Score = 2.720e+05 bits (147315),  Expect = 0.0
 Identities = 147315/147315 (100%), Gaps = 0/147315 (0%)
 Strand=Plus/Minus
Query  1         CATTTGGAAAGGAACCTGGGACCTAGGGTTAAAGGACAGATCCTCTGCTTTGCTACAGTG  60
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  62331329  CATTTGGAAAGGAACCTGGGACCTAGGGTTAAAGGACAGATCCTCTGCTTTGCTACAGTG  62331270
Query  61        TAATCTTGGACAGGTCATTCTACTTGGTCTCATCTGTAAAGAGAGGGGAGGGTCAGGGTA  120
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  62331269  TAATCTTGGACAGGTCATTCTACTTGGTCTCATCTGTAAAGAGAGGGGAGGGTCAGGGTA  62331210
...
  Database: ../../hg19.fa
    Posted date:  Nov 21, 2013  4:03 PM
  Number of letters in database: 3,095,693,983
  Number of sequences in database:  25
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

In addition to the species-specific sequence file, we should download the non-redundent gene and protein sequences from
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz and
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz respectively.
nt contains all nucleotide sequences of GenBank, RefSeq Nucleotides, EMBL, DDBJ and PDB, while nr contains all non-redundant peptide sequences from GenBank CDS translations, RefSeq Proteins, PDB, SwissProt, PIR and PRF. After generating the db files, we can blast a sequence to multiple species.

makeblastdb -in ../../nr/nr -out nr -dbtype prot
makeblastdb -in ../../nr/nt -out nt -dbtype nucl

“家属和记者取得联系”：记者的退场意味深长

要么空仓！要么盯紧这个！

劲爆！为了姜萍两位女CEO互揭老底！

最高院最新人事任免 | 全国审判业务专家茅仲华履新最高法副院长

建议写作爱好者，去网易写短篇故事，官方扶持可直接签约有道！