BioJulia · danielle-pinto · Mar 21, 2026 · Mar 21, 2026 · Mar 22, 2026
diff --git a/cookbook/03-blast.md b/cookbook/03-blast.md
@@ -0,0 +1,223 @@
++++
+using Dates
+date = Date("2026-03-15")
+title = "BLAST"
+rss_descr = "Using NCBIBlast.jl to run BLAST searches"
++++
+
+# Introduction to BLAST
+A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences.   
+It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.   
+
+More information about how to use BLAST can be found in its [manual](https://www.ncbi.nlm.nih.gov/books/NBK569856/).  
+
+BLAST searches can be run from the command line interface (CLI) or through the BLAST web page [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi).  
+A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match!     
+While searching from the website is fast and straightforward,   
+running searches from the website only performs searches against the NCBI databases.    
+The CLI allows users to query both NCBI databases and custom databases.    
+
+`NCBIBlast.jl` is a thin wrapper around the BLAST command line tool, 
+allowing users to run the tool within Julia.    
+The following BLAST tools are supported by `NCBIBlast`:
+- `blastn`
+- `blastp`
+- `tblastn`
+- `blastx`
+- `makeblastdb`
+
+Note: [BioTools BLAST](https://biojulia.dev/BioTools.jl/stable/blast/) is a deprecated Julia package for running BLAST searches and is different from `NCBIBLAST`.  
+
+
+
+
+# How `NCBIBlast.jl` works
+
+
+The keywords used in the tool are sent to the shell for running BLAST.  
+
+As stated on the GitHub [docs](https://github.com/BioJulia/NCBIBlast.jl), the Julia call
+
+```
+blastn(; query = "a_file.txt", db="mydb", out="results.txt")
+```
+
+is sent to the shell as
+
+```
+$ blastn -query a_file.txt -db mydb -out results.txt
+```
+
+# BLAST databases
+Just like running a BLAST search from the CLI, `NCBIBlast.jl` requires a BLAST database to search against.  
+The user can build a local, custom database,  
+or search against a specific NCBI database.       
+A custom BLAST database is constructed from FASTA files that serve as reference sequences.   
+A database can be built using the following command in `NCBIBlast.jl`:  
+```
+makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl")
+```
+
+More directions on building a BLAST database locally can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK569841/).  
+
+
+## Example: Building a local BLAST database and running the BLAST search
+
+For our first example, we will replicate the example on the `NCBIBlast.jl` GitHub repository.    
+
+First, we will build a local database using a FASTA file found in the NCBIBlast github repository ([link here](https://github.com/BioJulia/NCBIBlast.jl/blob/main/test/example_files/dna2.fasta)).  
+This file has been downloaded into `assets` as `dna2.fasta`.    
+
+```
+makeblastdb(; in="assets/dna2.fasta", dbtype="nucl")
+
+Building a new DB, current time: 03/16/2026 21:04:36
+New DB name:   /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta
+New DB title:  assets/dna2.fasta
+Sequence type: Nucleotide
+Keep MBits: T
+Maximum file size: 3000000000B
+Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds.
+
+
+Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))
+```
+A new database was built in `assets`.
+
+Now, we can define our query sequence.  
+We can save the query string in memory (using `IOBuffer`) rather than reading in a FASTA file. 
+
+```
+buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")
+```
+
+Now, we can run the BLAST search.  
+The BLAST output format "6" means that the output will be tab-delimited text with no column names.  
+
+The BLAST output will be written into I/O.  
+```
+io = IOBuffer();
+blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6");
+seek(io, 0);
+```
+The command `seek(io,0)` moves the cursor to the start of the captured object (index 0)   
+so it can be read into a dataframe via the [`DataFrames.jl`](https://dataframes.juliadata.org/stable/) package.  
+
+```
+using CSV, DataFrames
+CSV.read(io, DataFrame; header=false)
+
+1×12 DataFrame
+ Row │ Column1  Column2  Column3  Column4  Column5  Column6  Column7  Column8  Column9  Column10  Column11  Column12 
+     │ String7  String7  Float64  Int64    Int64    Int64    Int64    Int64    Int64    Int64     Float64   Float64  
+─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
+   1 │ Query_1  Test1      100.0       38        0        0        1       38       82       119  5.64e-18      71.3
+```
+
+### Interpreting BLAST Output
+This output tells us that the query sequence      
+(`Query_1` is the default name for the sequence because we didn't specify a name)  
+matches `Test1` in the reference database.   
+There is 100% identity between the query and a region on `Test1` that is 38 nucleotides long.   
+There are 0 mismatches or gap openings.  
+The match starts at index 1 on the query sequence, and ends at index 82.
+This region matches a region in the `Test1` sequence spanning from index 82 to 119.  
+The E-value is `5.64e-18`, meaning that it is extremely unlikely that this match occurred simply due to chance.  
+
+Here is a description of the E-value from the NCBI [website](https://blast.ncbi.nlm.nih.gov/doc/blast-help/FAQ.html):
+> The Expect value (E) is a parameter that describes the number of hits one can “expect” to see  
+> by chance when searching a database of a particular size.  
+> It decreases exponentially as the Score (S) of the match increases.   
+> The lower the E-value the more “significant” the match is.   
+> However keep in mind that virtually identical short alignments have relatively high E values.   
+> This is because the calculation of the E value takes into account the length of the query sequence.   
+> These high E values make sense because shorter sequences  
+> have a higher probability of occurring in the database purely by chance.   
+
+
+
+The bitscore is 71.3.  
+
+Here is a definition of bitscore from the NCBI BLAST [glossary](https://www.ncbi.nlm.nih.gov/books/NBK62051/):
+> The bit score, S', 
+> is derived from the raw alignment score, S, 
+> taking the statistical properties of the scoring system into account.   
+> Because bit scores are normalized with respect to the scoring system,  
+> they can be used to compare alignment scores from different searches.  
+
+
+## Example: BLASTing the _mecA1_ gene against all of NCBI
+Now that we've tried BLAST'ing against a local, custom database,   
+let's try BLAST'ing a piece of the _mecA_ gene against NCBI.  
+To create the query file `mecA_BLAST.fasta`,   
+I randomly selected 140 nucleotides from `mecA.fasta`.  
+
+We should see that the query FASTA is a direct hit to the _mecA_ gene  
+(one of the NCBI hits should definitely be the NCBI sample `NG_047945.1`,   
+which is the sample the gene fragment was extracted from).  
+
+For this BLAST search, I will search against the `core_nt` database,   
+which is a faster, smaller, and more focused subset of the traditional `nt` (nucleotide) database.  
+This newer database is the default as of August 2024.    
+It seeks to reduce redundancy and storage requirements when downloading.      
+More information about it can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/).    
+
+General information about the different kinds of BLAST databases is also available [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/databases.html).  
+
+```
+io = IOBuffer();
+blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect")
+seek(io, 0);
+CSV.read(io, DataFrame; header=false)
+```
+
+Here is the output of the BLAST.  
+
+```
+julia> CSV.read(io, DataFrame; header=false)
+500×12 DataFrame
+ Row │ Column1     Column2     Column3  Column4  Column5  Column6  Column7  Column8  Column9  Column10  Column11  Column12 
+     │ String15    String15    Float64  Int64    Int64    Int64    Int64    Int64    Int64    Int64     Float64   Int64    
+─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+   1 │ mecA_BLAST  CP026646.1    100.0      140        0        0        1      140    46719     46580  7.12e-65       259
+   2 │ mecA_BLAST  CP154497.1    100.0      140        0        0        1      140    45702     45563  7.12e-65       259
+   3 │ mecA_BLAST  CP049499.1    100.0      140        0        0        1      140    61014     60875  7.12e-65       259
+   4 │ mecA_BLAST  CP030403.1    100.0      140        0        0        1      140    48633     48494  7.12e-65       259
+   5 │ mecA_BLAST  CP132494.1    100.0      140        0        0        1      140  1867742   1867603  7.12e-65       259
+   6 │ mecA_BLAST  MH798864.1    100.0      140        0        0        1      140      281       420  7.12e-65       259
+   7 │ mecA_BLAST  CP162442.1    100.0      140        0        0        1      140   503005    503144  7.12e-65       259
+   8 │ mecA_BLAST  OR082611.1    100.0      140        0        0        1      140     6415      6276  7.12e-65       259
+   9 │ mecA_BLAST  CP053183.1    100.0      140        0        0        1      140    41607     41468  7.12e-65       259
+  10 │ mecA_BLAST  CP085310.1    100.0      140        0        0        1      140  1610215   1610076  7.12e-65       259
+  11 │ mecA_BLAST  CP162465.1    100.0      140        0        0        1      140  1140196   1140057  7.12e-65       259
+  12 │ mecA_BLAST  CP133660.1    100.0      140        0        0        1      140  2821019   2821158  7.12e-65       259
+  13 │ mecA_BLAST  CP049476.1    100.0      140        0        0        1      140    40314     40175  7.12e-65       259
+  ⋮  │     ⋮           ⋮          ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮         ⋮         ⋮
+ 489 │ mecA_BLAST  CP093527.1    100.0      140        0        0        1      140    42814     42675  7.12e-65       259
+ 490 │ mecA_BLAST  CP035541.1    100.0      140        0        0        1      140    72431     72292  7.12e-65       259
+ 491 │ mecA_BLAST  CP145216.2    100.0      140        0        0        1      140    45830     45691  7.12e-65       259
+ 492 │ mecA_BLAST  CP193734.1    100.0      140        0        0        1      140    64595     64456  7.12e-65       259
+ 493 │ mecA_BLAST  CP083210.2    100.0      140        0        0        1      140    43331     43192  7.12e-65       259
+ 494 │ mecA_BLAST  MW052033.1    100.0      140        0        0        1      140      245       384  7.12e-65       259
+ 495 │ mecA_BLAST  CP168087.1    100.0      140        0        0        1      140    40806     40667  7.12e-65       259
+ 496 │ mecA_BLAST  CP150769.1    100.0      140        0        0        1      140  2583517   2583378  7.12e-65       259
+ 497 │ mecA_BLAST  CP030596.1    100.0      140        0        0        1      140    40848     40709  7.12e-65       259
+ 498 │ mecA_BLAST  MZ398128.1    100.0      140        0        0        1      140    16977     17116  7.12e-65       259
+ 499 │ mecA_BLAST  CP083199.2    100.0      140        0        0        1      140    40078     39939  7.12e-65       259
+ 500 │ mecA_BLAST  CP162663.1    100.0      140        0        0        1      140   938360    938499  7.12e-65       259
+                                                                                                           475 rows omitted
+```
+There are 500 hits with the same E-value and Bitscore.  
+This likely means that this sequence is an exact match to these 500 sequences in NCBI.  
+Because of this, the first row in the results is not necessarily a better match than the 500th,   
+even though it appears first.  
+
+To verify the first hit, we can look up the GenBankID of the first row: `CP026646.1`.    
+The NCBI [page](https://www.ncbi.nlm.nih.gov/nuccore/CP026646.1/) for this sample confirms that this sample was phenotyped as _S. aureus_.  
+Our query matches indices 46719 to 46580 on this reference genome.  
+When we use the Graphics feature to visualize gene annotations on the reference genome,  
+we see that there is a clear match to _mecA_ in the region that matches the query.  
+
+![BLAST Graphics](assets/mecA_BLAST.png) 
+
+Overall, this confirms that our BLAST worked as corrected!  
diff --git a/cookbook/assets/mecA_BLAST.fasta b/cookbook/assets/mecA_BLAST.fasta
@@ -0,0 +1,3 @@
+>mecA_BLAST file that is used for BLAST tutorial
+GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT
+TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA
diff --git a/cookbook/assets/mecA_BLAST.png b/cookbook/assets/mecA_BLAST.png