diff --git a/cookbook/03-blast.md b/cookbook/03-blast.md new file mode 100644 index 0000000..647a833 --- /dev/null +++ b/cookbook/03-blast.md @@ -0,0 +1,223 @@ ++++ +using Dates +date = Date("2026-03-15") +title = "BLAST" +rss_descr = "Using NCBIBlast.jl to run BLAST searches" ++++ + +# Introduction to BLAST +A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences. +It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function. + +More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/). + +BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi). +A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match! +While searching from the website is fast and straightforward, +running searches from the website only performs searches against the NCBI databases. +The CLI allows users to query both NCBI databases and custom databases. + +`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool, +allowing users to run the tool within Julia. +The following BLAST tools are supported by `NCBIBlast`: +- `blastn` +- `blastp` +- `tblastn` +- `blastx` +- `makeblastdb` + +Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`. + + + + +# How `NCBIBlast.jl` works + + +The keywords used in the tool are sent to the shell for running BLAST. + +As stated on the GitHub [docs](https://github.com/BioJulia/NCBIBlast.jl), the Julia call + +``` +blastn(; query = "a_file.txt", db="mydb", out="results.txt") +``` + +is sent to the shell as + +``` +$ blastn -query a_file.txt -db mydb -out results.txt +``` + +# BLAST databases +Just like running a BLAST search from the CLI, `NCBIBlast.jl` requires a BLAST database to search against. +The user can build a local, custom database, +or search against a specific NCBI database. +A custom BLAST database is constructed from FASTA files that serve as reference sequences. +A database can be built using the following command in `NCBIBlast.jl`: +``` +makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl") +``` + +More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/). + + +## Example: Building a local BLAST database and running the BLAST search + +For our first example, we will replicate the example on the `NCBIBlast.jl` GitHub repository. + +First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)). +This file has been downloaded into `assets` as `dna2.fasta`. + +``` +makeblastdb(; in="assets/dna2.fasta", dbtype="nucl") + +Building a new DB, current time: 03/16/2026 21:04:36 +New DB name: /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta +New DB title: assets/dna2.fasta +Sequence type: Nucleotide +Keep MBits: T +Maximum file size: 3000000000B +Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds. + + +Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0)) +``` +A new database was built in `assets`. + +Now, we can define our query sequence. +We can save the query string in memory (using `IOBuffer`) rather than reading in a FASTA file. + +``` +buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG") +``` + +Now, we can run the BLAST search. +The BLAST output format "6" means that the output will be tab-delimited text with no column names. + +The BLAST output will be written into I/O. +``` +io = IOBuffer(); +blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6"); +seek(io, 0); +``` +The command `seek(io,0)` moves the cursor to the start of the captured object (index 0) +so it can be read into a dataframe via the [`DataFrames.jl`](https://dataframes.juliadata.org/stable/) package. + +``` +using CSV, DataFrames +CSV.read(io, DataFrame; header=false) + +1×12 DataFrame + Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 + │ String7 String7 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Float64 +─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────── + 1 │ Query_1 Test1 100.0 38 0 0 1 38 82 119 5.64e-18 71.3 +``` + +### Interpreting BLAST Output +This output tells us that the query sequence +(`Query_1` is the default name for the sequence because we didn't specify a name) +matches `Test1` in the reference database. +There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long. +There are 0 mismatches or gap openings. +The match starts at index 1 on the query sequence, and ends at index 82. +This region matches a region in the `Test1` sequence spanning from index 82 to 119. +The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance. + +Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html): +> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see +> by chance when searching a database of a particular size. +> It decreases exponentially as the Score (S) of the match increases. +> The lower the E-value the more “significant” the match is. +> However keep in mind that virtually identical short alignments have relatively high E values. +> This is because the calculation of the E value takes into account the length of the query sequence. +> These high E values make sense because shorter sequences +> have a higher probability of occurring in the database purely by chance. + + + +The bitscore is 71.3. + +Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/): +> The bit score, S', +> is derived from the raw alignment score, S, +> taking the statistical properties of the scoring system into account. +> Because bit scores are normalized with respect to the scoring system, +> they can be used to compare alignment scores from different searches. + + +## Example: BLASTing the _mecA1_ gene against all of NCBI +Now that we've tried BLAST'ing against a local, custom database, +let's try BLAST'ing a piece of the _mecA_ gene against NCBI. +To create the query file `mecA_BLAST.fasta`, +I randomly selected 140 nucleotides from `mecA.fasta`. + +We should see that the query FASTA is a direct hit to the _mecA_ gene +(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`, +which is the sample the gene fragment was extracted from). + +For this BLAST search, I will search against the `core_nt` database, +which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database. +This newer database is the default as of August 2024. +It seeks to reduce redundancy and storage requirements when downloading. +More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/). + +General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html). + +``` +io = IOBuffer(); +blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect") +seek(io, 0); +CSV.read(io, DataFrame; header=false) +``` + +Here is the output of the BLAST. + +``` +julia> CSV.read(io, DataFrame; header=false) +500×12 DataFrame + Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12 + │ String15 String15 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64 +─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── + 1 │ mecA_BLAST CP026646.1 100.0 140 0 0 1 140 46719 46580 7.12e-65 259 + 2 │ mecA_BLAST CP154497.1 100.0 140 0 0 1 140 45702 45563 7.12e-65 259 + 3 │ mecA_BLAST CP049499.1 100.0 140 0 0 1 140 61014 60875 7.12e-65 259 + 4 │ mecA_BLAST CP030403.1 100.0 140 0 0 1 140 48633 48494 7.12e-65 259 + 5 │ mecA_BLAST CP132494.1 100.0 140 0 0 1 140 1867742 1867603 7.12e-65 259 + 6 │ mecA_BLAST MH798864.1 100.0 140 0 0 1 140 281 420 7.12e-65 259 + 7 │ mecA_BLAST CP162442.1 100.0 140 0 0 1 140 503005 503144 7.12e-65 259 + 8 │ mecA_BLAST OR082611.1 100.0 140 0 0 1 140 6415 6276 7.12e-65 259 + 9 │ mecA_BLAST CP053183.1 100.0 140 0 0 1 140 41607 41468 7.12e-65 259 + 10 │ mecA_BLAST CP085310.1 100.0 140 0 0 1 140 1610215 1610076 7.12e-65 259 + 11 │ mecA_BLAST CP162465.1 100.0 140 0 0 1 140 1140196 1140057 7.12e-65 259 + 12 │ mecA_BLAST CP133660.1 100.0 140 0 0 1 140 2821019 2821158 7.12e-65 259 + 13 │ mecA_BLAST CP049476.1 100.0 140 0 0 1 140 40314 40175 7.12e-65 259 + ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ + 489 │ mecA_BLAST CP093527.1 100.0 140 0 0 1 140 42814 42675 7.12e-65 259 + 490 │ mecA_BLAST CP035541.1 100.0 140 0 0 1 140 72431 72292 7.12e-65 259 + 491 │ mecA_BLAST CP145216.2 100.0 140 0 0 1 140 45830 45691 7.12e-65 259 + 492 │ mecA_BLAST CP193734.1 100.0 140 0 0 1 140 64595 64456 7.12e-65 259 + 493 │ mecA_BLAST CP083210.2 100.0 140 0 0 1 140 43331 43192 7.12e-65 259 + 494 │ mecA_BLAST MW052033.1 100.0 140 0 0 1 140 245 384 7.12e-65 259 + 495 │ mecA_BLAST CP168087.1 100.0 140 0 0 1 140 40806 40667 7.12e-65 259 + 496 │ mecA_BLAST CP150769.1 100.0 140 0 0 1 140 2583517 2583378 7.12e-65 259 + 497 │ mecA_BLAST CP030596.1 100.0 140 0 0 1 140 40848 40709 7.12e-65 259 + 498 │ mecA_BLAST MZ398128.1 100.0 140 0 0 1 140 16977 17116 7.12e-65 259 + 499 │ mecA_BLAST CP083199.2 100.0 140 0 0 1 140 40078 39939 7.12e-65 259 + 500 │ mecA_BLAST CP162663.1 100.0 140 0 0 1 140 938360 938499 7.12e-65 259 + 475 rows omitted +``` +There are 500 hits with the same E-value and Bitscore. +This likely means that this sequence is an exact match to these 500 sequences in NCBI. +Because of this, the first row in the results is not necessarily a better match than the 500th, +even though it appears first. + +To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`. +The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_. +Our query matches indices 46719 to 46580 on this reference genome. +When we use the Graphics feature to visualize gene annotations on the reference genome, +we see that there is a clear match to _mecA_ in the region that matches the query. + +![BLAST Graphics](assets/mecA_BLAST.png) + +Overall, this confirms that our BLAST worked as corrected! \ No newline at end of file diff --git a/cookbook/assets/mecA_BLAST.fasta b/cookbook/assets/mecA_BLAST.fasta new file mode 100644 index 0000000..1a52e16 --- /dev/null +++ b/cookbook/assets/mecA_BLAST.fasta @@ -0,0 +1,3 @@ +>mecA_BLAST file that is used for BLAST tutorial +GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT +TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA diff --git a/cookbook/assets/mecA_BLAST.png b/cookbook/assets/mecA_BLAST.png new file mode 100644 index 0000000..a8a7a7c Binary files /dev/null and b/cookbook/assets/mecA_BLAST.png differ