Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Pull request overview
Adds treatment-level anomaly scoring/feature-contribution computation on aggregated (per-plate) profiles, and refactors the Isolation Forest feature-importance utility to use split-gain contributions.
Changes:
- Refactors
IsoforestFeatureImportanceto compute per-sample, per-feature split-gain contributions (with expected path-length normalization). - Adds a new notebook + nbconverted script to aggregate profiles to treatment-level and compute anomaly scores/contributions.
- Extends the single-cell anomaly pipeline shell script to run the new aggregate-treatment job.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
2.evaluate_data/utils/isolation_forest_data_feature_importance.py |
Replaces prior depth-based importance logic with per-sample split-gain contributions and optional per-tree feature-subset support. |
2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py |
New script to aggregate treatment profiles per plate and compute IsolationForest anomaly scores + contributions. |
2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb |
Source notebook for the above script (used by nbconvert). |
2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh |
Adds a step to run the aggregate-treatment anomaly script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| " morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index()\n", | ||
| " )\n", | ||
| "\n", | ||
| "feat_cols = list(feat_cols)\n", |
There was a problem hiding this comment.
feat_cols comes from a set() and is converted to a list without sorting, which makes feature order non-deterministic (and affects IsolationForest.feature_names_in_ and outputs). Convert to a stable order (e.g., sorted(feat_cols)) before fitting/saving.
| "feat_cols = list(feat_cols)\n", | |
| "feat_cols = sorted(feat_cols)\n", |
| } | ||
| }, | ||
| "outputs": [], | ||
| "source": [ |
There was a problem hiding this comment.
The notebook computes per-feature contributions in result but saves agg_platedf, so the feature importances are not persisted. Save result (or merge the contribution columns into what you write) so downstream analysis can access them.
| "source": [ | |
| "source": [ | |
| "# Merge per-feature contribution columns from `result` into `agg_platedf`\n", | |
| "contrib_cols = [col for col in result.columns if col not in agg_platedf.columns]\n", | |
| "if contrib_cols:\n", | |
| " agg_platedf = agg_platedf.join(result[contrib_cols], how=\"left\")\n", | |
| "\n", |
| /usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py" | ||
| /usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py" |
There was a problem hiding this comment.
py_path is defined (line 11), but the script uses $pypath when invoking the Python jobs. This will cause the script to fail with an undefined variable error and skip both feature-importance and aggregate-treatment runs. Use the same variable name consistently (e.g., replace $pypath with $py_path on these invocations).
| /usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py" | |
| /usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py" | |
| /usr/bin/time -v python3 "$py_path/compute_sc_feature_importance_data.py" | |
| /usr/bin/time -v python3 "$py_path/compute_aggregate_treatment_anomaly_data.py" |
| if "__file__" in globals() | ||
| else pathlib.Path.cwd() | ||
| ) | ||
| utils_dir = (script_dir.parent / "utils").resolve(strict=True) |
There was a problem hiding this comment.
When executed from nbconverted/, script_dir.parent / "utils" points to compute_sc_anomalyze_data/utils (which doesn’t exist) and resolve(strict=True) will raise. Build the utils path relative to 2.evaluate_data (e.g., script_dir.parents[1] / "utils") or reuse the git-root discovery logic used in other scripts.
| utils_dir = (script_dir.parent / "utils").resolve(strict=True) | |
| utils_dir = (script_dir.parents[1] / "utils").resolve(strict=True) |
| root_dir = pathlib.Path("../../").resolve(strict=True) | ||
| big_drive_path = pathlib.Path("/mnt/big_drive").resolve(strict=True) | ||
|
|
||
| agg_anomaly_data_path = (big_drive_path / "feature_selected_sc_qc_data").resolve( |
There was a problem hiding this comment.
big_drive_path is hard-coded to /mnt/big_drive, but the rest of the pipeline uses <git_root>/big_drive. This mismatch will break runs on systems without that mount. Consider deriving big_drive_path from root_dir / "big_drive" (and optionally allow an override via an env var/CLI arg).
| "root_dir = pathlib.Path(\"../../\").resolve(strict=True)\n", | ||
| "big_drive_path = pathlib.Path(\"/mnt/big_drive\").resolve(strict=True)\n", | ||
| "\n", | ||
| "agg_anomaly_data_path = (big_drive_path / \"feature_selected_sc_qc_data\").resolve(\n", |
There was a problem hiding this comment.
big_drive_path is hard-coded to /mnt/big_drive, but other pipeline scripts use <git_root>/big_drive. This will break runs on systems without that mount. Prefer deriving it from root_dir / "big_drive" (or make it configurable).
| "metadf = agg_platedf.copy().filter(regex=\"Metadata|Result\")\n", | ||
| "\n", | ||
| "result = IsoforestFeatureImportance(\n", | ||
| " estimators=isofor.estimators_,\n", |
There was a problem hiding this comment.
IsoforestFeatureImportance should be given estimators_features=isofor.estimators_features_ to correctly map each tree’s split feature indices back to the global feature columns (sklearn fits each estimator on X[:, estimators_features_[i]]). Without it, gains can be attributed to the wrong features.
| " estimators=isofor.estimators_,\n", | |
| " estimators=isofor.estimators_,\n", | |
| " estimators_features=isofor.estimators_features_,\n", |
| morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index() | ||
| ) | ||
|
|
||
| feat_cols = list(feat_cols) |
There was a problem hiding this comment.
feat_cols is built as a set and later converted to a list without sorting. This makes column order (and therefore IsolationForest.feature_names_in_ and downstream outputs) non-deterministic across runs/processes. Convert to a stable order (e.g., sorted(feat_cols)) before fitting/saving.
| feat_cols = list(feat_cols) | |
| feat_cols = sorted(feat_cols) |
| morphology_data=agg_platedf[list(isofor.feature_names_in_)], | ||
| )() | ||
|
|
||
|
|
There was a problem hiding this comment.
IsoforestFeatureImportance needs the per-tree feature index mapping (IsolationForest.estimators_features_) to correctly traverse trees, because each estimator is fit on X[:, estimators_features_[i]] in scikit-learn (even when max_features=1.0, this can be a permutation). Calling it without estimators_features can compute gains using the wrong input feature at each split. Pass estimators_features=isofor.estimators_features_ here (and similarly for other call sites using sklearn IsolationForest).
| morphology_data=agg_platedf[list(isofor.feature_names_in_)], | |
| )() | |
| estimators_features=isofor.estimators_features_, | |
| morphology_data=agg_platedf[list(isofor.feature_names_in_)], | |
| )() |
| " if \"__file__\" in globals()\n", | ||
| " else pathlib.Path.cwd()\n", | ||
| ")\n", | ||
| "utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n", |
There was a problem hiding this comment.
This utils_dir resolution works in a live notebook (no __file__), but the nbconvert-generated script will have __file__ defined and then script_dir.parent / "utils" points to compute_sc_anomalyze_data/utils (nonexistent), causing resolve(strict=True) to raise. Update the notebook to compute the utils path relative to 2.evaluate_data (e.g., script_dir.parents[1] / "utils") so the converted script runs.
| "utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n", | |
| "utils_dir = (script_dir.parents[1] / \"utils\").resolve(strict=True)\n", |
axiomcura
left a comment
There was a problem hiding this comment.
LGTM, left some comment and suggestions
| if not feat_cols: | ||
| feat_cols = plate_feat_cols | ||
| else: | ||
| feat_cols &= plate_feat_cols |
There was a problem hiding this comment.
I am assuming that the plates here will have the the same features space. If not, this can trigger some unwantd errors where the &= will return an empty set.
| else: | ||
| feat_cols &= plate_feat_cols | ||
|
|
||
| agg_mapping = dict.fromkeys(feat_cols, "sum") |
There was a problem hiding this comment.
is there a possibility where you might get a "0" ? what would happened in this code?
|
|
||
| metadf = agg_platedf.copy().filter(regex="Metadata|Result") | ||
|
|
||
| result = IsoforestFeatureImportance( |
There was a problem hiding this comment.
I see this variable as "result" but it doesn't seem to be saved at all. The last usage here is in meta_resultdf but it seems that doesn't get saved as well.
| isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1) | ||
| agg_platedf = agg_platedf.assign( | ||
| **{ | ||
| "Result_inlier": isofor.fit_predict(agg_platedf[feat_cols]), | ||
| "Result_anomaly_score": isofor.decision_function(agg_platedf[feat_cols]), | ||
| } | ||
| ) |
There was a problem hiding this comment.
If it is not important, I would recommend added a check for NaN or inf values if that is something that you don't want in the results.
| # In[6]: | ||
|
|
||
|
|
||
| isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1) |
There was a problem hiding this comment.
Just a quick question - my understanding is that IsolationForest typically requires a large number of samples to reliably identify potential outliers. Since you’re working with aggregate profiles, the effective sample size is greatly reduced. I’m wondering whether the model can robustly handle this shift, given how drastically the number of profiles decreases after aggregation.
| Compute expected path length c(n) for an array of node sizes, where c(n) | ||
| is the average path length of an unsuccessful Binary Search Tree (BST) search (same as the | ||
| iForest remaining-path normalizer and leaf correction term). |
There was a problem hiding this comment.
| Compute expected path length c(n) for an array of node sizes, where c(n) | |
| is the average path length of an unsuccessful Binary Search Tree (BST) search (same as the | |
| iForest remaining-path normalizer and leaf correction term). | |
| Compute expected path length c(n) for an array of node sizes, where c(n) | |
| is the average path length of an unsuccessful Binary Search Tree (BST) search (same | |
| as the iForest remaining-path normalizer and leaf correction term). |
line was too long
Computes anomaly data after aggregating to the treatment level in the pilot dataset. Treatment profiles are aggregated per plate, since single-cell profiles for a single plate take a large amount of memory. The anomaly data is stored locally, like much of the other JUMP data, to stay consistent.