Skip to content

Compute Anomaly Data Treatments#62

Open
MattsonCam wants to merge 2 commits intomainfrom
compute_agg_anomaly
Open

Compute Anomaly Data Treatments#62
MattsonCam wants to merge 2 commits intomainfrom
compute_agg_anomaly

Conversation

@MattsonCam
Copy link
Member

Computes anomaly data after aggregating to the treatment level in the pilot dataset. Treatment profiles are aggregated per plate, since single-cell profiles for a single plate take a large amount of memory. The anomaly data is stored locally, like much of the other JUMP data, to stay consistent.

Copilot AI review requested due to automatic review settings February 12, 2026 16:57
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds treatment-level anomaly scoring/feature-contribution computation on aggregated (per-plate) profiles, and refactors the Isolation Forest feature-importance utility to use split-gain contributions.

Changes:

  • Refactors IsoforestFeatureImportance to compute per-sample, per-feature split-gain contributions (with expected path-length normalization).
  • Adds a new notebook + nbconverted script to aggregate profiles to treatment-level and compute anomaly scores/contributions.
  • Extends the single-cell anomaly pipeline shell script to run the new aggregate-treatment job.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

File Description
2.evaluate_data/utils/isolation_forest_data_feature_importance.py Replaces prior depth-based importance logic with per-sample split-gain contributions and optional per-tree feature-subset support.
2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py New script to aggregate treatment profiles per plate and compute IsolationForest anomaly scores + contributions.
2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb Source notebook for the above script (used by nbconvert).
2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh Adds a step to run the aggregate-treatment anomaly script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

" morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index()\n",
" )\n",
"\n",
"feat_cols = list(feat_cols)\n",
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feat_cols comes from a set() and is converted to a list without sorting, which makes feature order non-deterministic (and affects IsolationForest.feature_names_in_ and outputs). Convert to a stable order (e.g., sorted(feat_cols)) before fitting/saving.

Suggested change
"feat_cols = list(feat_cols)\n",
"feat_cols = sorted(feat_cols)\n",

Copilot uses AI. Check for mistakes.
}
},
"outputs": [],
"source": [
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook computes per-feature contributions in result but saves agg_platedf, so the feature importances are not persisted. Save result (or merge the contribution columns into what you write) so downstream analysis can access them.

Suggested change
"source": [
"source": [
"# Merge per-feature contribution columns from `result` into `agg_platedf`\n",
"contrib_cols = [col for col in result.columns if col not in agg_platedf.columns]\n",
"if contrib_cols:\n",
" agg_platedf = agg_platedf.join(result[contrib_cols], how=\"left\")\n",
"\n",

Copilot uses AI. Check for mistakes.
Comment on lines 51 to +52
/usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py"
/usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py"
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py_path is defined (line 11), but the script uses $pypath when invoking the Python jobs. This will cause the script to fail with an undefined variable error and skip both feature-importance and aggregate-treatment runs. Use the same variable name consistently (e.g., replace $pypath with $py_path on these invocations).

Suggested change
/usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py"
/usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py"
/usr/bin/time -v python3 "$py_path/compute_sc_feature_importance_data.py"
/usr/bin/time -v python3 "$py_path/compute_aggregate_treatment_anomaly_data.py"

Copilot uses AI. Check for mistakes.
if "__file__" in globals()
else pathlib.Path.cwd()
)
utils_dir = (script_dir.parent / "utils").resolve(strict=True)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When executed from nbconverted/, script_dir.parent / "utils" points to compute_sc_anomalyze_data/utils (which doesn’t exist) and resolve(strict=True) will raise. Build the utils path relative to 2.evaluate_data (e.g., script_dir.parents[1] / "utils") or reuse the git-root discovery logic used in other scripts.

Suggested change
utils_dir = (script_dir.parent / "utils").resolve(strict=True)
utils_dir = (script_dir.parents[1] / "utils").resolve(strict=True)

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +37
root_dir = pathlib.Path("../../").resolve(strict=True)
big_drive_path = pathlib.Path("/mnt/big_drive").resolve(strict=True)

agg_anomaly_data_path = (big_drive_path / "feature_selected_sc_qc_data").resolve(
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_drive_path is hard-coded to /mnt/big_drive, but the rest of the pipeline uses <git_root>/big_drive. This mismatch will break runs on systems without that mount. Consider deriving big_drive_path from root_dir / "big_drive" (and optionally allow an override via an env var/CLI arg).

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +76
"root_dir = pathlib.Path(\"../../\").resolve(strict=True)\n",
"big_drive_path = pathlib.Path(\"/mnt/big_drive\").resolve(strict=True)\n",
"\n",
"agg_anomaly_data_path = (big_drive_path / \"feature_selected_sc_qc_data\").resolve(\n",
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_drive_path is hard-coded to /mnt/big_drive, but other pipeline scripts use <git_root>/big_drive. This will break runs on systems without that mount. Prefer deriving it from root_dir / "big_drive" (or make it configurable).

Copilot uses AI. Check for mistakes.
"metadf = agg_platedf.copy().filter(regex=\"Metadata|Result\")\n",
"\n",
"result = IsoforestFeatureImportance(\n",
" estimators=isofor.estimators_,\n",
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsoforestFeatureImportance should be given estimators_features=isofor.estimators_features_ to correctly map each tree’s split feature indices back to the global feature columns (sklearn fits each estimator on X[:, estimators_features_[i]]). Without it, gains can be attributed to the wrong features.

Suggested change
" estimators=isofor.estimators_,\n",
" estimators=isofor.estimators_,\n",
" estimators_features=isofor.estimators_features_,\n",

Copilot uses AI. Check for mistakes.
morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index()
)

feat_cols = list(feat_cols)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feat_cols is built as a set and later converted to a list without sorting. This makes column order (and therefore IsolationForest.feature_names_in_ and downstream outputs) non-deterministic across runs/processes. Convert to a stable order (e.g., sorted(feat_cols)) before fitting/saving.

Suggested change
feat_cols = list(feat_cols)
feat_cols = sorted(feat_cols)

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +139
morphology_data=agg_platedf[list(isofor.feature_names_in_)],
)()


Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsoforestFeatureImportance needs the per-tree feature index mapping (IsolationForest.estimators_features_) to correctly traverse trees, because each estimator is fit on X[:, estimators_features_[i]] in scikit-learn (even when max_features=1.0, this can be a permutation). Calling it without estimators_features can compute gains using the wrong input feature at each split. Pass estimators_features=isofor.estimators_features_ here (and similarly for other call sites using sklearn IsolationForest).

Suggested change
morphology_data=agg_platedf[list(isofor.feature_names_in_)],
)()
estimators_features=isofor.estimators_features_,
morphology_data=agg_platedf[list(isofor.feature_names_in_)],
)()

Copilot uses AI. Check for mistakes.
" if \"__file__\" in globals()\n",
" else pathlib.Path.cwd()\n",
")\n",
"utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n",
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utils_dir resolution works in a live notebook (no __file__), but the nbconvert-generated script will have __file__ defined and then script_dir.parent / "utils" points to compute_sc_anomalyze_data/utils (nonexistent), causing resolve(strict=True) to raise. Update the notebook to compute the utils path relative to 2.evaluate_data (e.g., script_dir.parents[1] / "utils") so the converted script runs.

Suggested change
"utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n",
"utils_dir = (script_dir.parents[1] / \"utils\").resolve(strict=True)\n",

Copilot uses AI. Check for mistakes.
@MattsonCam MattsonCam requested a review from axiomcura February 12, 2026 18:34
Copy link
Member

@axiomcura axiomcura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some comment and suggestions

Comment on lines +73 to +76
if not feat_cols:
feat_cols = plate_feat_cols
else:
feat_cols &= plate_feat_cols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming that the plates here will have the the same features space. If not, this can trigger some unwantd errors where the &= will return an empty set.

else:
feat_cols &= plate_feat_cols

agg_mapping = dict.fromkeys(feat_cols, "sum")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a possibility where you might get a "0" ? what would happened in this code?


metadf = agg_platedf.copy().filter(regex="Metadata|Result")

result = IsoforestFeatureImportance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this variable as "result" but it doesn't seem to be saved at all. The last usage here is in meta_resultdf but it seems that doesn't get saved as well.

Comment on lines +117 to +123
isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1)
agg_platedf = agg_platedf.assign(
**{
"Result_inlier": isofor.fit_predict(agg_platedf[feat_cols]),
"Result_anomaly_score": isofor.decision_function(agg_platedf[feat_cols]),
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not important, I would recommend added a check for NaN or inf values if that is something that you don't want in the results.

# In[6]:


isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question - my understanding is that IsolationForest typically requires a large number of samples to reliably identify potential outliers. Since you’re working with aggregate profiles, the effective sample size is greatly reduced. I’m wondering whether the model can robustly handle this shift, given how drastically the number of profiles decreases after aggregation.

Comment on lines +15 to +17
Compute expected path length c(n) for an array of node sizes, where c(n)
is the average path length of an unsuccessful Binary Search Tree (BST) search (same as the
iForest remaining-path normalizer and leaf correction term).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Compute expected path length c(n) for an array of node sizes, where c(n)
is the average path length of an unsuccessful Binary Search Tree (BST) search (same as the
iForest remaining-path normalizer and leaf correction term).
Compute expected path length c(n) for an array of node sizes, where c(n)
is the average path length of an unsuccessful Binary Search Tree (BST) search (same
as the iForest remaining-path normalizer and leaf correction term).

line was too long

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants