Skip to content

feat!(ingest, workflows): add GCA downloads to mirror and enable use as segment grouping overrides in ingest#6112

Merged
anna-parker merged 21 commits intomainfrom
calculate_gca_groups_2
Mar 19, 2026
Merged

feat!(ingest, workflows): add GCA downloads to mirror and enable use as segment grouping overrides in ingest#6112
anna-parker merged 21 commits intomainfrom
calculate_gca_groups_2

Conversation

@anna-parker
Copy link
Copy Markdown
Contributor

@anna-parker anna-parker commented Mar 11, 2026

resolves #5905

Alternative to #5917 - have action download hydrated assemblies from NCBI virus, unzip and re-hydrate assemblies, then rezip assemblies with tar.zstd.

In #5917 the action downloaded the hydrated assemblies, unzipped and then rezipped with tar.zstd

BREAKING CHANGE

Users of ingest with the grouping_override key should rename it to grouping_override_url.

Numbers

The dehydrated files are 19M, unzipped 260M and 2.3G unzipped and dehydrated.
(I can even zip the dehydrated + zipped files once more with tar.zstd and get them down to 9.7M).
The unzipped and dehydrated file when rezipped with tar.zstd is 79M.

Download, re-hydration and zipping took 40min in the CLI action for influenza A, download from the mirror locally took seconds, unzipping also took under a minute. (Downloading from the mirror and rehydrating in ingest took 50min)

Testing using CCHF on the preview

I am now using these results for CCHF on the preview:

{
    "GCA_000854165": [
        "AY389361.2",
        "AF467768.2",
        "U88410.1"
    ],
    "GCF_000854165": [
        "NC_005301.3",
        "NC_005300.2",
        "NC_005302.1"
    ]
}

are now all together: https://calculate-gca-groups-2.loculus.org/seq/LOC_000N04F.1 and https://calculate-gca-groups-2.loculus.org/seq/LOC_000PX88.1

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude Bot added ingest Ingest pipeline deployment Code changes targetting the deployment infrastructure labels Mar 11, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 11, 2026

This PR may be related to: #5905 (Ingest: add gca accession downloads to the ingest pipeline)

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Mar 11, 2026
@anna-parker anna-parker changed the title Calculate gca groups 2 feat(ingest, workflows): add GCA downloads to mirror and enable use as segment grouping overrides in ingest Mar 11, 2026
@anna-parker anna-parker marked this pull request as ready for review March 11, 2026 20:51
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 11, 2026

Claude finished @anna-parker's task in 2m 52s —— View job


Code Review

  • Gather context and read changed files
  • Review .github/workflows/datasets-assembly-mirror.yml
  • Review ingest/scripts/calculate_assembly_groups.py
  • Review ingest/Snakefile changes
  • Review kubernetes/loculus/values.schema.json and values.yaml
  • Post inline comments and final review summary

Good overall implementation. The approach of downloading, rehydrating, and mirroring assembly data as tar.zst archives is clean, and the Snakefile flow for conditional group override calculation is logically structured. A few issues to address:

Breaking change: The rename of grouping_overridegrouping_override_url in the schema and Snakefile is a breaking change for any existing configs using the old key. Worth a note in the PR or a migration check.

Missing schema entry: grouping_ignore_list_url is referenced in the Snakefile (config.get("grouping_ignore_list_url", "")) but has no entry in values.schema.json. This leaves it undocumented and unvalidated for Helm users.

No tests: The new calculate_assembly_groups.py script has no automated tests. The PR checklist acknowledges this — it would be good to add at least a basic unit test for the happy path and the duplicate-detection logic.

See inline comments for specific issues.

Comment thread .github/workflows/datasets-assembly-mirror.yml Outdated
Comment thread .github/workflows/datasets-assembly-mirror.yml Outdated
Comment thread .github/workflows/datasets-assembly-mirror.yml Outdated
Comment thread .github/workflows/datasets-assembly-mirror.yml
Comment thread ingest/scripts/calculate_assembly_groups.py
Comment thread kubernetes/loculus/values.yaml
Comment thread ingest/scripts/calculate_assembly_groups.py Outdated
Comment thread ingest/scripts/calculate_assembly_groups.py
Comment thread ingest/Snakefile
Comment thread ingest/Snakefile Outdated
Comment thread kubernetes/loculus/values.schema.json
@theosanderson
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f71257fb4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ingest/Snakefile
@anna-parker anna-parker changed the title feat(ingest, workflows): add GCA downloads to mirror and enable use as segment grouping overrides in ingest feat!(ingest, workflows): add GCA downloads to mirror and enable use as segment grouping overrides in ingest Mar 13, 2026
@anna-parker anna-parker added ingest Ingest pipeline and removed ingest Ingest pipeline preview Triggers a deployment to argocd labels Mar 19, 2026
Comment thread ingest/scripts/calculate_assembly_groups.py
Comment thread ingest/Snakefile
Copy link
Copy Markdown
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not code I know well but looks sensible to me!

@anna-parker anna-parker merged commit 1a5aca3 into main Mar 19, 2026
71 checks passed
@anna-parker anna-parker deleted the calculate_gca_groups_2 branch March 19, 2026 17:15
anna-parker added a commit to pathoplexus/pathoplexus that referenced this pull request Mar 24, 2026
I forgot that we are actually using this param on PPX during the last update, see loculus-project/loculus#6112 for details
anna-parker added a commit to pathoplexus/pathoplexus that referenced this pull request Mar 25, 2026
I forgot that we are actually using this param on PPX during the last update, see loculus-project/loculus#6112 for details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment Code changes targetting the deployment infrastructure ingest Ingest pipeline

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingest: add gca accession downloads to the ingest pipeline

2 participants