feat!(ingest, workflows): add GCA downloads to mirror and enable use as segment grouping overrides in ingest#6112
Conversation
|
This PR may be related to: #5905 (Ingest: add gca accession downloads to the ingest pipeline) |
|
Claude finished @anna-parker's task in 2m 52s —— View job Code Review
Good overall implementation. The approach of downloading, rehydrating, and mirroring assembly data as Breaking change: The rename of Missing schema entry: No tests: The new See inline comments for specific issues. |
Co-authored-by: Anna (Anya) Parker <50943381+anna-parker@users.noreply.github.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f71257fb4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
theosanderson
left a comment
There was a problem hiding this comment.
Not code I know well but looks sensible to me!
I forgot that we are actually using this param on PPX during the last update, see loculus-project/loculus#6112 for details
I forgot that we are actually using this param on PPX during the last update, see loculus-project/loculus#6112 for details
resolves #5905
Alternative to #5917 - have action download hydrated assemblies from NCBI virus, unzip and re-hydrate assemblies, then rezip assemblies with
tar.zstd.In #5917 the action downloaded the hydrated assemblies, unzipped and then rezipped with
tar.zstdBREAKING CHANGE
Users of ingest with the
grouping_overridekey should rename it togrouping_override_url.Numbers
The dehydrated files are 19M, unzipped 260M and 2.3G unzipped and dehydrated.
(I can even zip the dehydrated + zipped files once more with tar.zstd and get them down to 9.7M).
The unzipped and dehydrated file when rezipped with tar.zstd is 79M.
Download, re-hydration and zipping took 40min in the CLI action for influenza A, download from the mirror locally took seconds, unzipping also took under a minute. (Downloading from the mirror and rehydrating in ingest took 50min)
Testing using CCHF on the preview
I am now using these results for CCHF on the preview:
are now all together: https://calculate-gca-groups-2.loculus.org/seq/LOC_000N04F.1 and https://calculate-gca-groups-2.loculus.org/seq/LOC_000PX88.1
Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable