Split cleanData and add Parquet exports by AbhirupaGhosh · Pull Request #11 · JRaviLab/amRdata

AbhirupaGhosh · 2026-02-19T15:55:20Z

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Description

What kind of change(s) are included?

Feature (adds or updates new capabilities)
Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

I have read and followed the CONTRIBUTING.md guidelines.
I have searched for existing content to ensure this is not a duplicate.
I have performed a self-review of these additions (including spelling, grammar, and related).
I have added comments to my code to help provide understanding.
I have added a test which covers the code changes found within this PR.
I have deleted all non-relevant text in this pull request template.
Reviewer assignment: Tag a relevant team member to review and approve the changes.

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)

AbhirupaGhosh · 2026-02-27T22:06:44Z

R/data_curation.R

+    message("  amr_phenotype: ", n_amr_ids)
+    message("  genomes with 0 AMR rows: ", length(ids_zero_amr))
+    if (length(ids_zero_amr)) {
+      message("  e.g.: ", paste(utils::head(ids_zero_amr, 10), collapse = ", "))


It printed

Initial summary: targets=24421 | AMR genomes=24422 | genome_data genomes=24423 Final summary: targets : 24421 bac_data : 24421 genome_data : 24423 amr_phenotype: 24422 genomes with 0 AMR rows: 2 e.g.: , genome.genome_id

So I am guessing it is considering column name as genome id!

We can strip this out. It was for debugging originally which is why it's messy. Alternatively, if it is useful to keep a summary, I can work on some improvements for it!

I didn't find any logical impact of this misprinting. We can come back to this later. I have made an issue.

AbhirupaGhosh · 2026-02-27T23:21:17Z

R/data_curation.R

+  message("FTPS pass 1 (30s timeout)")
+  future::plan(future::multisession, workers = max(1, workers_first))
+  res1 <- future.apply::future_lapply(
+    genome_ids,
+    function(gid) {
+      ok <- .ftpes_download_one(gid, out_dir,
+        connect_timeout = 10L, max_time = 30L,
+        speed_time = 30L, speed_limit = 2048L


Suggested change

message("FTPS pass 1 (30s timeout)")

future::plan(future::multisession, workers = max(1, workers_first))

res1 <- future.apply::future_lapply(

genome_ids,

function(gid) {

ok <- .ftpes_download_one(gid, out_dir,

connect_timeout = 10L, max_time = 30L,

speed_time = 30L, speed_limit = 2048L

message("FTPS pass 1 (45s timeout)")

future::plan(future::multisession, workers = max(1, workers_first))

res1 <- future.apply::future_lapply(

genome_ids,

function(gid) {

ok <- .ftpes_download_one(gid, out_dir,

connect_timeout = 10L, max_time = 45L,

speed_time = 30L, speed_limit = 2048L

AbhirupaGhosh · 2026-02-27T23:21:40Z

R/data_curation.R

+  message("FTPS pass 2 (60s timeout) for failed genomes")
+  future::plan(future::multisession, workers = max(1, workers_second))
+  res2 <- future.apply::future_lapply(
+    fail_ids,
+    function(gid) {
+      ok <- .ftpes_download_one(gid, out_dir,
+        connect_timeout = 10L, max_time = 60L,
+        speed_time = 30L, speed_limit = 2048L
+      )
+      list(gid = gid, ok = ok)
+    },
+    future.seed = TRUE
+  )


Suggested change

message("FTPS pass 2 (60s timeout) for failed genomes")

future::plan(future::multisession, workers = max(1, workers_second))

res2 <- future.apply::future_lapply(

fail_ids,

function(gid) {

ok <- .ftpes_download_one(gid, out_dir,

connect_timeout = 10L, max_time = 60L,

speed_time = 30L, speed_limit = 2048L

)

list(gid = gid, ok = ok)

},

future.seed = TRUE

)

message("FTPS pass 2 (120s timeout) for failed genomes")

future::plan(future::multisession, workers = max(1, workers_second))

res2 <- future.apply::future_lapply(

fail_ids,

function(gid) {

ok <- .ftpes_download_one(gid, out_dir,

connect_timeout = 10L, max_time = 120L,

speed_time = 30L, speed_limit = 2048L

)

list(gid = gid, ok = ok)

},

future.seed = TRUE

)

AbhirupaGhosh requested review from charmvang, epbrenner and jananiravi February 19, 2026 15:55

AbhirupaGhosh self-assigned this Feb 19, 2026

AbhirupaGhosh and others added 3 commits February 19, 2026 15:56

Style code (GHA)

c52aaa7

Updating download logic

4edc964

Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)

Style code (GHA)

5287d1b

AbhirupaGhosh assigned AbhirupaGhosh and epbrenner and unassigned AbhirupaGhosh Feb 26, 2026

AbhirupaGhosh commented Feb 27, 2026

View reviewed changes

AbhirupaGhosh mentioned this pull request Feb 27, 2026

Summary printing after retrieveMetadata #12

Open

AbhirupaGhosh commented Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split cleanData and add Parquet exports#11

Split cleanData and add Parquet exports#11
AbhirupaGhosh wants to merge 4 commits intomainfrom
cleanData

AbhirupaGhosh commented Feb 19, 2026

Uh oh!

AbhirupaGhosh Feb 27, 2026

Uh oh!

epbrenner Feb 27, 2026

Uh oh!

AbhirupaGhosh Feb 27, 2026

Uh oh!

AbhirupaGhosh Feb 27, 2026

Uh oh!

AbhirupaGhosh Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AbhirupaGhosh commented Feb 19, 2026

Description

What kind of change(s) are included?

Checklist

Uh oh!

AbhirupaGhosh Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

epbrenner Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

AbhirupaGhosh Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

AbhirupaGhosh Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

AbhirupaGhosh Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants