Skip to content

feat(silo): add a MutationProfile filter#1189

Open
Taepper wants to merge 1 commit intomainfrom
1179-mutation-profile
Open

feat(silo): add a MutationProfile filter#1189
Taepper wants to merge 1 commit intomainfrom
1179-mutation-profile

Conversation

@Taepper
Copy link
Copy Markdown
Collaborator

@Taepper Taepper commented Mar 2, 2026

resolves #1179

Summary

This adds a MutationProfile filter to silo. The behavior of this filter is outlined in #1179

PR Checklist

  • All necessary documentation has been adapted or there is an issue to do so.
  • The implemented feature is covered by an appropriate test.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 2, 2026

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.11.2 (2026-04-15)

Features

  • silo: add a MutationProfile filter (3838964)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new MutationProfile filter expression to SILO (per #1179) and introduces an optimized compilation path for large N-Of expressions over sequence positions to avoid repeated vertical-index lookups. This enables efficient “distance to profile” queries that expand into many per-position symbol conditions.

Changes:

  • Introduces NucleotideMutationProfile / AminoAcidMutationProfile expression that rewrites into an N-Of/Not form.
  • Adds a single-pass vertical-index DP helper (VerticalSequenceIndex::buildNOfDpTable) and a new NOf compile fast-path for SymbolInSet children on the same sequence.
  • Adds docs, integration tests, and performance benchmarks/utilities for profiling the new behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/silo/query_engine/filter/expressions/mutation_profile.h Defines the new MutationProfile expression template and JSON parsing hook.
src/silo/query_engine/filter/expressions/mutation_profile.cpp Implements profile construction (querySequence / sequenceId / mutations) and rewrite to Not(N-Of(...)).
src/silo/query_engine/filter/expressions/expression.cpp Registers new expression types: NucleotideMutationProfile and AminoAcidMutationProfile.
src/silo/query_engine/filter/expressions/symbol_in_set.h Adds getters needed for the new NOf compilation optimization.
src/silo/query_engine/filter/expressions/nof.cpp Adds optimized compile path that batches vertical-index access and inlines the threshold DP.
src/silo/storage/column/vertical_sequence_index.h Declares PositionQuery and buildNOfDpTable DP helper.
src/silo/storage/column/vertical_sequence_index.cpp Implements buildNOfDpTable with a forward scan over vertical_bitmaps.
src/silo/test/mutation_profile.test.cpp Adds integration tests for NucleotideMutationProfile behavior and error cases.
documentation/query_documentation.md Documents NucleotideMutationProfile and AminoAcidMutationProfile JSON formats and semantics.
performance/sequence_generator.h Adds shared benchmark utilities for generating synthetic sequences/reads and initializing DBs.
performance/nof_sequence_filter.cpp Adds a benchmark targeting the large-N-Of optimization via MutationProfile.
performance/many_short_read_filters.cpp Refactors to reuse sequence_generator.h.
performance/CMakeLists.txt Ensures benchmarks can include performance headers and adds the new benchmark target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/silo/query_engine/filter/expressions/nof.cpp Outdated
Comment thread src/silo/query_engine/filter/expressions/nof.cpp Outdated
Comment thread src/silo/storage/column/vertical_sequence_index.cpp Outdated
Comment thread src/silo/test/mutation_profile.test.cpp
Comment thread performance/sequence_generator.h
@Taepper Taepper force-pushed the 1179-mutation-profile branch from 3c324c7 to d15c000 Compare March 3, 2026 09:30
@Taepper Taepper force-pushed the 1179-mutation-profile branch from d15c000 to 9832a96 Compare March 11, 2026 14:08
Comment thread documentation/query_documentation.md Outdated
Comment thread documentation/query_documentation.md Outdated
Comment thread src/silo/query_engine/filter/expressions/mutation_profile.cpp Outdated
Comment thread src/silo/query_engine/filter/expressions/mutation_profile.cpp
Comment thread src/silo/test/mutation_profile.test.cpp
@Taepper Taepper force-pushed the 1179-mutation-profile branch 2 times, most recently from 71b3438 to bcf53b1 Compare April 15, 2026 13:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +57
template <typename T>
std::string joinWithLimit(
const std::vector<T>& items,
std::string_view delimiter = ", ",
size_t limit = 10
) {
std::string res;
const size_t items_to_print = std::min(items.size(), limit);

for (size_t i = 0; i < items_to_print; ++i) {
if (i > 0) {
res += delimiter;
}
// Assumes items[i] has a toString() method or works with fmt
res += items[i]->toString();
}

if (items.size() > items_to_print) {
res += fmt::format("{}... ({} more)", delimiter, items.size() - items_to_print);
}

return res;
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joinWithLimit is templated as if it supports arbitrary T, but it unconditionally calls items[i]->toString(). Either constrain the template to pointer-like types (e.g., via a requires clause) or update the comment/signature to reflect that it expects T to be a smart/raw pointer to a type with toString().

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +137
const auto& primary_key_column = table.columns.string_columns.at(primary_key_name);
for (uint32_t row_id = 0; row_id < static_cast<uint32_t>(primary_key_column.numValues());
++row_id) {
if (primary_key_column.getValueString(row_id) == seq_id) {
found_row_id = row_id;
break;
}
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildProfileFromSequenceId() does an O(n) linear scan over the primary-key column when the PK type is STRING. On large tables this will make sequenceId-based MutationProfile queries expensive even though it’s just a lookup. Consider requiring/using an indexed primary key for this path or adding a lookup/index helper to retrieve the row id without a full scan.

Suggested change
const auto& primary_key_column = table.columns.string_columns.at(primary_key_name);
for (uint32_t row_id = 0; row_id < static_cast<uint32_t>(primary_key_column.numValues());
++row_id) {
if (primary_key_column.getValueString(row_id) == seq_id) {
found_row_id = row_id;
break;
}
}
throw IllegalQueryException(fmt::format(
"{} MutationProfile sequenceId lookup requires an indexed string primary key; "
"primary key '{}' is configured as STRING",
SymbolType::SYMBOL_NAME,
primary_key_name
));

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +123
std::vector<std::string_view> current_gen = {reference};
std::bernoulli_distribution survives(1.0 - death_rate);

for (size_t gen = 0; gen < generations; ++gen) {
std::vector<std::string_view> next_gen;
for (const auto& seq : current_gen) {
for (size_t child = 0; child < children_per_node; ++child) {
if (survives(rng)) {
all_sequences.push_back(mutateSequence(seq));
next_gen.push_back(all_sequences.back());
}
}
}
if (next_gen.empty()) {
next_gen.push_back(all_sequences.back());
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SequenceTreeGenerator::generateEvolvedSequences() stores std::string_view entries that refer to strings inside all_sequences, but all_sequences.push_back(...) can reallocate and invalidate all existing views. This is undefined behavior and can lead to corrupted sequences/crashes in the benchmarks. Use stable storage (e.g., std::deque<std::string>), reserve enough capacity up front, or store indices/owning std::string values in current_gen/next_gen instead of string_view.

Suggested change
std::vector<std::string_view> current_gen = {reference};
std::bernoulli_distribution survives(1.0 - death_rate);
for (size_t gen = 0; gen < generations; ++gen) {
std::vector<std::string_view> next_gen;
for (const auto& seq : current_gen) {
for (size_t child = 0; child < children_per_node; ++child) {
if (survives(rng)) {
all_sequences.push_back(mutateSequence(seq));
next_gen.push_back(all_sequences.back());
}
}
}
if (next_gen.empty()) {
next_gen.push_back(all_sequences.back());
std::vector<size_t> current_gen = {0};
std::bernoulli_distribution survives(1.0 - death_rate);
for (size_t gen = 0; gen < generations; ++gen) {
std::vector<size_t> next_gen;
for (size_t seq_index : current_gen) {
for (size_t child = 0; child < children_per_node; ++child) {
if (survives(rng)) {
all_sequences.push_back(mutateSequence(all_sequences.at(seq_index)));
next_gen.push_back(all_sequences.size() - 1);
}
}
}
if (next_gen.empty()) {
next_gen.push_back(all_sequences.size() - 1);

Copilot uses AI. Check for mistakes.
"You need to provide the sequence name with the AminoAcid MutationProfile filter."
};

// No sequenceName provided and no default AA sequence in the config → error
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes the “no sequenceName provided” error case, but the scenario below is actually testing an out-of-bounds mutation position with sequenceName: "gene1". Update the comment to match what the test is asserting.

Suggested change
// No sequenceName provided and no default AA sequence in the config → error
// Mutation position is outside the bounds of the specified AA sequence → error

Copilot uses AI. Check for mistakes.
Comment on lines 7 to 11
#include <boost/algorithm/string/join.hpp>
#include <nlohmann/json.hpp>

#include "silo/common/string_utils.h"
#include "silo/query_engine/filter/expressions/expression.h"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<boost/algorithm/string/join.hpp> appears unused after switching toString() to joinWithLimit. Consider removing the Boost include to reduce compile time and avoid unused-include drift.

Copilot uses AI. Check for mistakes.
@@ -11,6 +11,7 @@
#include <boost/algorithm/string/join.hpp>
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<boost/algorithm/string/join.hpp> appears unused after switching toString() to joinWithLimit. Consider removing the Boost include to reduce compile time and avoid unused-include drift.

Suggested change
#include <boost/algorithm/string/join.hpp>

Copilot uses AI. Check for mistakes.
Comment on lines 45 to 56
std::string Threshold::toString() const {
std::string res;
if (match_exactly) {
res += "=";
} else {
res += ">=";
}
for (const auto& child : this->non_negated_children) {
res += ", " + child->toString();
}
for (const auto& child : this->non_negated_children) {
res += ", ! " + child->toString();
}
res += fmt::format("{}-of", number_of_matchers);

res += "non_negated: (" + joinWithLimit(non_negated_children) + ") ";
res += "negated: (" + joinWithLimit(negated_children) + ") ";
res += ")";
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold::toString() ends with ), but never adds an opening (, and it concatenates "{}-of" directly with "non_negated" without a separator. This produces malformed/unreadable operator strings; adjust the formatting to include balanced parentheses and clear separators (e.g., prefix with ( / Threshold( and add a space or delimiter after -of).

Copilot uses AI. Check for mistakes.
)
};

// Profile with explicit mutation C at position 1 (0-indexed: 0), distance=0
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the mutation position is “0-indexed”, but the filter JSON uses 1-based positions (and the code enforces that). Please update the comment to avoid confusion about the indexing convention.

Suggested change
// Profile with explicit mutation C at position 1 (0-indexed: 0), distance=0
// Profile with explicit mutation C at 1-based position 1, distance=0

Copilot uses AI. Check for mistakes.
Comment on lines 45 to +52
std::string Intersection::toString() const {
std::string res = "(" + children[0]->toString();

for (uint32_t i = 1; i < children.size(); i++) {
res += " & " + children[i]->toString();
}
for (const auto& child : negated_children) {
res += " &! " + child->toString();
res += joinWithLimit(children, " & ");
if (!negated_children.empty()) {
res += " &! " + joinWithLimit(negated_children, " &! ");
}

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intersection::toString() currently prepends children[0]->toString() and then appends joinWithLimit(children, " & "), which reprints the first child and also drops the delimiter between the two parts (resulting in duplicated/concatenated output). Build the string using only joinWithLimit(children, " & ") (or join from the second child) to avoid duplication and ensure correct separators.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +96
"querySequence": string
| "sequenceId": string
| "mutations": {"position": number, "symbol": string}[]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"querySequence": string
| "sequenceId": string
| "mutations": {"position": number, "symbol": string}[]
// exactly one of these
"querySequence": string
"sequenceId": string
"mutations": {"position": number, "symbol": string}[]

would be easier to understand (at least for me)

Comment on lines +48 to +49
// Assumes items[i] has a toString() method or works with fmt
res += items[i]->toString();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe overkill here, but actually concepts would be the way to properly type this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

Comment thread src/silo/query_engine/filter/expressions/nof.cpp
Comment thread src/silo/query_engine/filter/expressions/mutation_profile.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MutationProfile filter

3 participants