From 21933d551b5a0bd98a1a518a63ccce48737e1fc4 Mon Sep 17 00:00:00 2001
From: David Laehnemann <david.laehnemann@hhu.de>
Date: Tue, 3 Sep 2024 12:48:21 +0200
Subject: [PATCH 1/2] docs: update and amend simulation stage output file
 explanations

---
 README.md | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 065b446..cdee556 100755
--- a/README.md
+++ b/README.md
@@ -646,23 +646,35 @@ __Example runs:__
 
 ### 2. Simulation stage  
 
-1. `simulated_reads.fasta`
-  FASTA file of simulated reads. Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.  
+#### read files
+
+Two FASTA files of simulated reads, or FASTQ files if the `--fastq` option is set:
+
+1. `simulated_aligned_reads.fast(a|q)`
+2. `simulated_unaligned_reads.fast(a|q)` (this file does not get generated, if you request `--perfect` reads without errors)
+  
+In these files, each read has `unaligned`, `aligned`, or `perfect` in the header recording their error rate:
+* `unaligned` means that the reads have an error rate over 90% and cannot be aligned.
+* `aligned` reads have the same error rate as training reads.
+* `perfect` reads have no errors.  
   
-  To explain the information in the header, we have two examples:  
-  * `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0`  
-    All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases.  
-  * `>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2`  
-    This is an aligned read coming from chromosome XI at position 115406. `16565` is the sequence index. `R` represents a reverse complement strand. `92_12710_2` means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.  
+To explain the information in the header, we have two examples:  
+* `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0`  
+  All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases.  
+* `>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2`  
+  This is an aligned read coming from chromosome XI at position 115406. `16565` is the sequence index. `R` represents a reverse complement strand. `92_12710_2` means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.  
   
-  The information in the header can help users to locate the read easily.  
+The information in the header can help users to locate the read easily.  
   
 __Specific to transcriptome simulation__: for reads that include retained introns, the header contains the information starting from `Retained_intron`, each genomic interval is separated by `;`.
 
 __Specific to chimeric reads simulation__: for chimeric reads, different source chromosome and locations are separated by `;`, and there's a `chimeric` in the header to indicate.
+
+#### error profile file
   
-2. `simulated_error_profile`
-  Contains all the information of errors introduced into each reads, including error type, position, original bases and current bases.  
+This file contains all the information of errors introduced into each reads, including error type, position, original bases and current bases:
+
+3. `simulated_aligned_error_profile`
 
 
 ## Acknowledgements

From 5c34c7d6bc42f58dbbe88c504a76d6e5f3657bd4 Mon Sep 17 00:00:00 2001
From: David Laehnemann <david.laehnemann@hhu.de>
Date: Tue, 3 Sep 2024 18:14:24 +0200
Subject: [PATCH 2/2] docs: explain that metagenome simulations create one set
 of files per simulated sample

---
 README.md | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index cdee556..c511109 100755
--- a/README.md
+++ b/README.md
@@ -648,16 +648,18 @@ __Example runs:__
 
 #### read files
 
-Two FASTA files of simulated reads, or FASTQ files if the `--fastq` option is set:
+Two FASTA files of simulated reads are usually produced, or FASTQ files if the `--fastq` option is set:
 
 1. `simulated_aligned_reads.fast(a|q)`
 2. `simulated_unaligned_reads.fast(a|q)` (this file does not get generated, if you request `--perfect` reads without errors)
-  
+
+For `metagenome` mode simulations, these two files are produced for each simulated sample, with samples systematically named: `simulated_sample0_aligned_reads.fast(a|q), simulated_sample1_aligned_reads.fast(a|q), ...`
+
 In these files, each read has `unaligned`, `aligned`, or `perfect` in the header recording their error rate:
 * `unaligned` means that the reads have an error rate over 90% and cannot be aligned.
 * `aligned` reads have the same error rate as training reads.
-* `perfect` reads have no errors.  
-  
+* `perfect` reads have no errors.
+
 To explain the information in the header, we have two examples:  
 * `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0`  
   All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases.  
@@ -676,6 +678,8 @@ This file contains all the information of errors introduced into each reads, inc
 
 3. `simulated_aligned_error_profile`
 
+For `metagenome` mode simulations, this file is produced for each simulated sample, with samples systematically named: `simulated_sample0_error_profile, simulated_sample1_error_profile, ...`
+
 
 ## Acknowledgements
 Sincere thanks to our labmates and all contributors and users of this tool.