Conversation
|
This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog. Note that this might contain changes that are on main, but not yet released. Changelog: 0.11.2 (2026-04-15)Features
|
There was a problem hiding this comment.
Pull request overview
Adds a new MutationProfile filter expression to SILO (per #1179) and introduces an optimized compilation path for large N-Of expressions over sequence positions to avoid repeated vertical-index lookups. This enables efficient “distance to profile” queries that expand into many per-position symbol conditions.
Changes:
- Introduces
NucleotideMutationProfile/AminoAcidMutationProfileexpression that rewrites into anN-Of/Notform. - Adds a single-pass vertical-index DP helper (
VerticalSequenceIndex::buildNOfDpTable) and a newNOfcompile fast-path forSymbolInSetchildren on the same sequence. - Adds docs, integration tests, and performance benchmarks/utilities for profiling the new behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/silo/query_engine/filter/expressions/mutation_profile.h | Defines the new MutationProfile expression template and JSON parsing hook. |
| src/silo/query_engine/filter/expressions/mutation_profile.cpp | Implements profile construction (querySequence / sequenceId / mutations) and rewrite to Not(N-Of(...)). |
| src/silo/query_engine/filter/expressions/expression.cpp | Registers new expression types: NucleotideMutationProfile and AminoAcidMutationProfile. |
| src/silo/query_engine/filter/expressions/symbol_in_set.h | Adds getters needed for the new NOf compilation optimization. |
| src/silo/query_engine/filter/expressions/nof.cpp | Adds optimized compile path that batches vertical-index access and inlines the threshold DP. |
| src/silo/storage/column/vertical_sequence_index.h | Declares PositionQuery and buildNOfDpTable DP helper. |
| src/silo/storage/column/vertical_sequence_index.cpp | Implements buildNOfDpTable with a forward scan over vertical_bitmaps. |
| src/silo/test/mutation_profile.test.cpp | Adds integration tests for NucleotideMutationProfile behavior and error cases. |
| documentation/query_documentation.md | Documents NucleotideMutationProfile and AminoAcidMutationProfile JSON formats and semantics. |
| performance/sequence_generator.h | Adds shared benchmark utilities for generating synthetic sequences/reads and initializing DBs. |
| performance/nof_sequence_filter.cpp | Adds a benchmark targeting the large-N-Of optimization via MutationProfile. |
| performance/many_short_read_filters.cpp | Refactors to reuse sequence_generator.h. |
| performance/CMakeLists.txt | Ensures benchmarks can include performance headers and adds the new benchmark target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3c324c7 to
d15c000
Compare
d15c000 to
9832a96
Compare
71b3438 to
bcf53b1
Compare
bcf53b1 to
3838964
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| template <typename T> | ||
| std::string joinWithLimit( | ||
| const std::vector<T>& items, | ||
| std::string_view delimiter = ", ", | ||
| size_t limit = 10 | ||
| ) { | ||
| std::string res; | ||
| const size_t items_to_print = std::min(items.size(), limit); | ||
|
|
||
| for (size_t i = 0; i < items_to_print; ++i) { | ||
| if (i > 0) { | ||
| res += delimiter; | ||
| } | ||
| // Assumes items[i] has a toString() method or works with fmt | ||
| res += items[i]->toString(); | ||
| } | ||
|
|
||
| if (items.size() > items_to_print) { | ||
| res += fmt::format("{}... ({} more)", delimiter, items.size() - items_to_print); | ||
| } | ||
|
|
||
| return res; | ||
| } |
There was a problem hiding this comment.
joinWithLimit is templated as if it supports arbitrary T, but it unconditionally calls items[i]->toString(). Either constrain the template to pointer-like types (e.g., via a requires clause) or update the comment/signature to reflect that it expects T to be a smart/raw pointer to a type with toString().
| const auto& primary_key_column = table.columns.string_columns.at(primary_key_name); | ||
| for (uint32_t row_id = 0; row_id < static_cast<uint32_t>(primary_key_column.numValues()); | ||
| ++row_id) { | ||
| if (primary_key_column.getValueString(row_id) == seq_id) { | ||
| found_row_id = row_id; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
buildProfileFromSequenceId() does an O(n) linear scan over the primary-key column when the PK type is STRING. On large tables this will make sequenceId-based MutationProfile queries expensive even though it’s just a lookup. Consider requiring/using an indexed primary key for this path or adding a lookup/index helper to retrieve the row id without a full scan.
| const auto& primary_key_column = table.columns.string_columns.at(primary_key_name); | |
| for (uint32_t row_id = 0; row_id < static_cast<uint32_t>(primary_key_column.numValues()); | |
| ++row_id) { | |
| if (primary_key_column.getValueString(row_id) == seq_id) { | |
| found_row_id = row_id; | |
| break; | |
| } | |
| } | |
| throw IllegalQueryException(fmt::format( | |
| "{} MutationProfile sequenceId lookup requires an indexed string primary key; " | |
| "primary key '{}' is configured as STRING", | |
| SymbolType::SYMBOL_NAME, | |
| primary_key_name | |
| )); |
| std::vector<std::string_view> current_gen = {reference}; | ||
| std::bernoulli_distribution survives(1.0 - death_rate); | ||
|
|
||
| for (size_t gen = 0; gen < generations; ++gen) { | ||
| std::vector<std::string_view> next_gen; | ||
| for (const auto& seq : current_gen) { | ||
| for (size_t child = 0; child < children_per_node; ++child) { | ||
| if (survives(rng)) { | ||
| all_sequences.push_back(mutateSequence(seq)); | ||
| next_gen.push_back(all_sequences.back()); | ||
| } | ||
| } | ||
| } | ||
| if (next_gen.empty()) { | ||
| next_gen.push_back(all_sequences.back()); |
There was a problem hiding this comment.
SequenceTreeGenerator::generateEvolvedSequences() stores std::string_view entries that refer to strings inside all_sequences, but all_sequences.push_back(...) can reallocate and invalidate all existing views. This is undefined behavior and can lead to corrupted sequences/crashes in the benchmarks. Use stable storage (e.g., std::deque<std::string>), reserve enough capacity up front, or store indices/owning std::string values in current_gen/next_gen instead of string_view.
| std::vector<std::string_view> current_gen = {reference}; | |
| std::bernoulli_distribution survives(1.0 - death_rate); | |
| for (size_t gen = 0; gen < generations; ++gen) { | |
| std::vector<std::string_view> next_gen; | |
| for (const auto& seq : current_gen) { | |
| for (size_t child = 0; child < children_per_node; ++child) { | |
| if (survives(rng)) { | |
| all_sequences.push_back(mutateSequence(seq)); | |
| next_gen.push_back(all_sequences.back()); | |
| } | |
| } | |
| } | |
| if (next_gen.empty()) { | |
| next_gen.push_back(all_sequences.back()); | |
| std::vector<size_t> current_gen = {0}; | |
| std::bernoulli_distribution survives(1.0 - death_rate); | |
| for (size_t gen = 0; gen < generations; ++gen) { | |
| std::vector<size_t> next_gen; | |
| for (size_t seq_index : current_gen) { | |
| for (size_t child = 0; child < children_per_node; ++child) { | |
| if (survives(rng)) { | |
| all_sequences.push_back(mutateSequence(all_sequences.at(seq_index))); | |
| next_gen.push_back(all_sequences.size() - 1); | |
| } | |
| } | |
| } | |
| if (next_gen.empty()) { | |
| next_gen.push_back(all_sequences.size() - 1); |
| "You need to provide the sequence name with the AminoAcid MutationProfile filter." | ||
| }; | ||
|
|
||
| // No sequenceName provided and no default AA sequence in the config → error |
There was a problem hiding this comment.
This comment describes the “no sequenceName provided” error case, but the scenario below is actually testing an out-of-bounds mutation position with sequenceName: "gene1". Update the comment to match what the test is asserting.
| // No sequenceName provided and no default AA sequence in the config → error | |
| // Mutation position is outside the bounds of the specified AA sequence → error |
| #include <boost/algorithm/string/join.hpp> | ||
| #include <nlohmann/json.hpp> | ||
|
|
||
| #include "silo/common/string_utils.h" | ||
| #include "silo/query_engine/filter/expressions/expression.h" |
There was a problem hiding this comment.
<boost/algorithm/string/join.hpp> appears unused after switching toString() to joinWithLimit. Consider removing the Boost include to reduce compile time and avoid unused-include drift.
| @@ -11,6 +11,7 @@ | |||
| #include <boost/algorithm/string/join.hpp> | |||
There was a problem hiding this comment.
<boost/algorithm/string/join.hpp> appears unused after switching toString() to joinWithLimit. Consider removing the Boost include to reduce compile time and avoid unused-include drift.
| #include <boost/algorithm/string/join.hpp> |
| std::string Threshold::toString() const { | ||
| std::string res; | ||
| if (match_exactly) { | ||
| res += "="; | ||
| } else { | ||
| res += ">="; | ||
| } | ||
| for (const auto& child : this->non_negated_children) { | ||
| res += ", " + child->toString(); | ||
| } | ||
| for (const auto& child : this->non_negated_children) { | ||
| res += ", ! " + child->toString(); | ||
| } | ||
| res += fmt::format("{}-of", number_of_matchers); | ||
|
|
||
| res += "non_negated: (" + joinWithLimit(non_negated_children) + ") "; | ||
| res += "negated: (" + joinWithLimit(negated_children) + ") "; | ||
| res += ")"; |
There was a problem hiding this comment.
Threshold::toString() ends with ), but never adds an opening (, and it concatenates "{}-of" directly with "non_negated" without a separator. This produces malformed/unreadable operator strings; adjust the formatting to include balanced parentheses and clear separators (e.g., prefix with ( / Threshold( and add a space or delimiter after -of).
| ) | ||
| }; | ||
|
|
||
| // Profile with explicit mutation C at position 1 (0-indexed: 0), distance=0 |
There was a problem hiding this comment.
The comment says the mutation position is “0-indexed”, but the filter JSON uses 1-based positions (and the code enforces that). Please update the comment to avoid confusion about the indexing convention.
| // Profile with explicit mutation C at position 1 (0-indexed: 0), distance=0 | |
| // Profile with explicit mutation C at 1-based position 1, distance=0 |
| std::string Intersection::toString() const { | ||
| std::string res = "(" + children[0]->toString(); | ||
|
|
||
| for (uint32_t i = 1; i < children.size(); i++) { | ||
| res += " & " + children[i]->toString(); | ||
| } | ||
| for (const auto& child : negated_children) { | ||
| res += " &! " + child->toString(); | ||
| res += joinWithLimit(children, " & "); | ||
| if (!negated_children.empty()) { | ||
| res += " &! " + joinWithLimit(negated_children, " &! "); | ||
| } | ||
|
|
There was a problem hiding this comment.
Intersection::toString() currently prepends children[0]->toString() and then appends joinWithLimit(children, " & "), which reprints the first child and also drops the delimiter between the two parts (resulting in duplicated/concatenated output). Build the string using only joinWithLimit(children, " & ") (or join from the second child) to avoid duplication and ensure correct separators.
| "querySequence": string | ||
| | "sequenceId": string | ||
| | "mutations": {"position": number, "symbol": string}[] |
There was a problem hiding this comment.
| "querySequence": string | |
| | "sequenceId": string | |
| | "mutations": {"position": number, "symbol": string}[] | |
| // exactly one of these | |
| "querySequence": string | |
| "sequenceId": string | |
| "mutations": {"position": number, "symbol": string}[] |
would be easier to understand (at least for me)
| // Assumes items[i] has a toString() method or works with fmt | ||
| res += items[i]->toString(); |
There was a problem hiding this comment.
Maybe overkill here, but actually concepts would be the way to properly type this?
resolves #1179
Summary
This adds a
MutationProfilefilter to silo. The behavior of this filter is outlined in #1179PR Checklist