Skip to content

Feature/refactor_DIMS_MakeInit#94

Open
mraves2 wants to merge 5 commits intodevelopfrom
feature/refactor_DIMS_MakeInit
Open

Feature/refactor_DIMS_MakeInit#94
mraves2 wants to merge 5 commits intodevelopfrom
feature/refactor_DIMS_MakeInit

Conversation

@mraves2
Copy link
Contributor

@mraves2 mraves2 commented Feb 13, 2026

MakeInit hernoemd naar ParseSamplesheet
Code opgeschoond, replication pattern wordt nu gemaakt adhv de informatie in de samplesheet. Variabele nr_replicates is niet meer nodig. Unit test toegevoegd.

#' @param sample_sheet: matrix of file names and sample names
#'
#' @return ints_sorted: list of sample names with corresponding file names (technical replicates)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these roxygen style comments be places above the function definition? https://roxygen2.r-lib.org/articles/roxygen2.html#basic-process

@@ -0,0 +1,29 @@
# function for parse_samplesheet
generate_repl_pattern <- function(sample_sheet) {
#' Generate replication pattern list based on information in sample_sheet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the function name should be generate_repl_pattern_list?

#'
#' @return ints_sorted: list of sample names with corresponding file names (technical replicates)

# get the right columns from the samplesheet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-line comments are more usefull if they explain a choice, rather than stating what the line(s) below it does.

the right column

does not make it clearer, but if you explain the rationale then it does become clearer :)

file_name_col <- grep("File_Name|File Name", colnames(sample_sheet))
sample_name_col <- grep("Sample_Name|Sample Name", colnames(sample_sheet))
# get the unique sample names from the samplesheet
sample_names <- sort(unique(trimws(as.vector(unlist(sample_sheet[sample_name_col])))))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you consider using piping to make this more readable?

Suggested change
sample_names <- sort(unique(trimws(as.vector(unlist(sample_sheet[sample_name_col])))))
sample_names <- sample_sheet[sample_name_col] |>
unlist() |>
as.vector() |>
trimws() |>
unique() |>
sort()

Comment on lines +19 to +24
repl_pattern <- c()
for (sample_group in sample_names) {
file_indices <- which(sample_sheet[, sample_name_col] == sample_group)
file_names <- sample_sheet[file_indices, file_name_col]
repl_pattern <- c(repl_pattern, list(file_names))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of misschien?

Suggested change
repl_pattern <- c()
for (sample_group in sample_names) {
file_indices <- which(sample_sheet[, sample_name_col] == sample_group)
file_names <- sample_sheet[file_indices, file_name_col]
repl_pattern <- c(repl_pattern, list(file_names))
}
repl_pattern <- split(
sample_sheet[[file_name_col]],
sample_sheet[[sample_name_col]]
)[sample_names]

repl_pattern <- generate_repl_pattern(sample_sheet)

# write the replication pattern to text file for troubleshooting purposes
sink("replication_pattern.txt")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something to the output name to more clearly indicate that this is a logging output, and not actual 'data'. e.g., replication_pattern_log.txt


script:
"""
Rscript ${baseDir}/CustomModules/DIMS/ParseSamplesheet.R $samplesheet $params.preprocessing_scripts_dir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using $params inside a module/process is not nextflow best practice.

The best option is to add it as an input. So input becomes:

input:
    path(samplesheet)
    path(preprocessing_scripts_dir)


script:
"""
Rscript ${baseDir}/CustomModules/DIMS/ParseSamplesheet.R $samplesheet $params.preprocessing_scripts_dir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Rscript referred from the baseDir? For stability sake, it it would make more sense to use a resource folder and put it in there. For example in the PRS repo: https://github.com/UMCUGenetics/DxNextflowPRS/blob/develop/modules/local/SNPlist/main.nf (this is python but it works in the same way)

Rscript ${baseDir}/CustomModules/DIMS/ParseSamplesheet.R 

tag "DIMS ParseSamplesheet"
label 'ParseSamplesheet'
container = 'docker://umcugenbioinf/dims:1.3'
shell = ['/bin/bash', '-euo', 'pipefail']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shell = ['/bin/bash', '-euo', 'pipefail']

Remove those here, shell directives are/should be declared in the main nextflow.config

Comment on lines +11 to +12
path('init.RData')
path('replication_pattern.txt')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs are missing emit statements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants