From 21933d551b5a0bd98a1a518a63ccce48737e1fc4 Mon Sep 17 00:00:00 2001 From: David Laehnemann Date: Tue, 3 Sep 2024 12:48:21 +0200 Subject: [PATCH 1/2] docs: update and amend simulation stage output file explanations --- README.md | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 065b446..cdee556 100755 --- a/README.md +++ b/README.md @@ -646,23 +646,35 @@ __Example runs:__ ### 2. Simulation stage -1. `simulated_reads.fasta` - FASTA file of simulated reads. Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors. +#### read files + +Two FASTA files of simulated reads, or FASTQ files if the `--fastq` option is set: + +1. `simulated_aligned_reads.fast(a|q)` +2. `simulated_unaligned_reads.fast(a|q)` (this file does not get generated, if you request `--perfect` reads without errors) + +In these files, each read has `unaligned`, `aligned`, or `perfect` in the header recording their error rate: +* `unaligned` means that the reads have an error rate over 90% and cannot be aligned. +* `aligned` reads have the same error rate as training reads. +* `perfect` reads have no errors. - To explain the information in the header, we have two examples: - * `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0` - All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases. - * `>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2` - This is an aligned read coming from chromosome XI at position 115406. `16565` is the sequence index. `R` represents a reverse complement strand. `92_12710_2` means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region. +To explain the information in the header, we have two examples: +* `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0` + All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases. +* `>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2` + This is an aligned read coming from chromosome XI at position 115406. `16565` is the sequence index. `R` represents a reverse complement strand. `92_12710_2` means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region. - The information in the header can help users to locate the read easily. +The information in the header can help users to locate the read easily. __Specific to transcriptome simulation__: for reads that include retained introns, the header contains the information starting from `Retained_intron`, each genomic interval is separated by `;`. __Specific to chimeric reads simulation__: for chimeric reads, different source chromosome and locations are separated by `;`, and there's a `chimeric` in the header to indicate. + +#### error profile file -2. `simulated_error_profile` - Contains all the information of errors introduced into each reads, including error type, position, original bases and current bases. +This file contains all the information of errors introduced into each reads, including error type, position, original bases and current bases: + +3. `simulated_aligned_error_profile` ## Acknowledgements From 5c34c7d6bc42f58dbbe88c504a76d6e5f3657bd4 Mon Sep 17 00:00:00 2001 From: David Laehnemann Date: Tue, 3 Sep 2024 18:14:24 +0200 Subject: [PATCH 2/2] docs: explain that metagenome simulations create one set of files per simulated sample --- README.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index cdee556..c511109 100755 --- a/README.md +++ b/README.md @@ -648,16 +648,18 @@ __Example runs:__ #### read files -Two FASTA files of simulated reads, or FASTQ files if the `--fastq` option is set: +Two FASTA files of simulated reads are usually produced, or FASTQ files if the `--fastq` option is set: 1. `simulated_aligned_reads.fast(a|q)` 2. `simulated_unaligned_reads.fast(a|q)` (this file does not get generated, if you request `--perfect` reads without errors) - + +For `metagenome` mode simulations, these two files are produced for each simulated sample, with samples systematically named: `simulated_sample0_aligned_reads.fast(a|q), simulated_sample1_aligned_reads.fast(a|q), ...` + In these files, each read has `unaligned`, `aligned`, or `perfect` in the header recording their error rate: * `unaligned` means that the reads have an error rate over 90% and cannot be aligned. * `aligned` reads have the same error rate as training reads. -* `perfect` reads have no errors. - +* `perfect` reads have no errors. + To explain the information in the header, we have two examples: * `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0` All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases. @@ -676,6 +678,8 @@ This file contains all the information of errors introduced into each reads, inc 3. `simulated_aligned_error_profile` +For `metagenome` mode simulations, this file is produced for each simulated sample, with samples systematically named: `simulated_sample0_error_profile, simulated_sample1_error_profile, ...` + ## Acknowledgements Sincere thanks to our labmates and all contributors and users of this tool.