Compute Anomaly Data Treatments by MattsonCam · Pull Request #62 · WayScience/JUMP-single-cell

MattsonCam · 2026-02-12T16:57:56Z

Computes anomaly data after aggregating to the treatment level in the pilot dataset. Treatment profiles are aggregated per plate, since single-cell profiles for a single plate take a large amount of memory. The anomaly data is stored locally, like much of the other JUMP data, to stay consistent.

review-notebook-app · 2026-02-12T16:58:02Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Copilot

Pull request overview

Adds treatment-level anomaly scoring/feature-contribution computation on aggregated (per-plate) profiles, and refactors the Isolation Forest feature-importance utility to use split-gain contributions.

Changes:

Refactors IsoforestFeatureImportance to compute per-sample, per-feature split-gain contributions (with expected path-length normalization).
Adds a new notebook + nbconverted script to aggregate profiles to treatment-level and compute anomaly scores/contributions.
Extends the single-cell anomaly pipeline shell script to run the new aggregate-treatment job.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

File	Description
`2.evaluate_data/utils/isolation_forest_data_feature_importance.py`	Replaces prior depth-based importance logic with per-sample split-gain contributions and optional per-tree feature-subset support.
`2.evaluate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py`	New script to aggregate treatment profiles per plate and compute IsolationForest anomaly scores + contributions.
`2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb`	Source notebook for the above script (used by `nbconvert`).
`2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh`	Adds a step to run the aggregate-treatment anomaly script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T17:06:44Z

2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb

+    "        morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index()\n",
+    "    )\n",
+    "\n",
+    "feat_cols = list(feat_cols)\n",


feat_cols comes from a set() and is converted to a list without sorting, which makes feature order non-deterministic (and affects IsolationForest.feature_names_in_ and outputs). Convert to a stable order (e.g., sorted(feat_cols)) before fitting/saving.

Suggested change

"feat_cols = list(feat_cols)\n",

"feat_cols = sorted(feat_cols)\n",

Copilot · 2026-02-12T17:06:44Z

2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb

+    }
+   },
+   "outputs": [],
+   "source": [


The notebook computes per-feature contributions in result but saves agg_platedf, so the feature importances are not persisted. Save result (or merge the contribution columns into what you write) so downstream analysis can access them.

Suggested change

"source": [

"source": [

"# Merge per-feature contribution columns from `result` into `agg_platedf`\n",

"contrib_cols = [col for col in result.columns if col not in agg_platedf.columns]\n",

"if contrib_cols:\n",

" agg_platedf = agg_platedf.join(result[contrib_cols], how=\"left\")\n",

"\n",

Copilot · 2026-02-12T17:06:44Z

2.evaluate_data/compute_sc_anomalyze_data/compute_all_sc_anomalyze_data.sh

 /usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py"
+/usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py"


py_path is defined (line 11), but the script uses $pypath when invoking the Python jobs. This will cause the script to fail with an undefined variable error and skip both feature-importance and aggregate-treatment runs. Use the same variable name consistently (e.g., replace $pypath with $py_path on these invocations).

Suggested change

/usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py"

/usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py"

/usr/bin/time -v python3 "$py_path/compute_sc_feature_importance_data.py"

/usr/bin/time -v python3 "$py_path/compute_aggregate_treatment_anomaly_data.py"

Copilot · 2026-02-12T17:06:45Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+    if "__file__" in globals()
+    else pathlib.Path.cwd()
+)
+utils_dir = (script_dir.parent / "utils").resolve(strict=True)


When executed from nbconverted/, script_dir.parent / "utils" points to compute_sc_anomalyze_data/utils (which doesn’t exist) and resolve(strict=True) will raise. Build the utils path relative to 2.evaluate_data (e.g., script_dir.parents[1] / "utils") or reuse the git-root discovery logic used in other scripts.

Suggested change

utils_dir = (script_dir.parent / "utils").resolve(strict=True)

utils_dir = (script_dir.parents[1] / "utils").resolve(strict=True)

Copilot · 2026-02-12T17:06:45Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+root_dir = pathlib.Path("../../").resolve(strict=True)
+big_drive_path = pathlib.Path("/mnt/big_drive").resolve(strict=True)
+
+agg_anomaly_data_path = (big_drive_path / "feature_selected_sc_qc_data").resolve(


big_drive_path is hard-coded to /mnt/big_drive, but the rest of the pipeline uses <git_root>/big_drive. This mismatch will break runs on systems without that mount. Consider deriving big_drive_path from root_dir / "big_drive" (and optionally allow an override via an env var/CLI arg).

Copilot · 2026-02-12T17:06:46Z

2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb

+    "root_dir = pathlib.Path(\"../../\").resolve(strict=True)\n",
+    "big_drive_path = pathlib.Path(\"/mnt/big_drive\").resolve(strict=True)\n",
+    "\n",
+    "agg_anomaly_data_path = (big_drive_path / \"feature_selected_sc_qc_data\").resolve(\n",


big_drive_path is hard-coded to /mnt/big_drive, but other pipeline scripts use <git_root>/big_drive. This will break runs on systems without that mount. Prefer deriving it from root_dir / "big_drive" (or make it configurable).

Copilot · 2026-02-12T17:06:47Z

2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb

+    "metadf = agg_platedf.copy().filter(regex=\"Metadata|Result\")\n",
+    "\n",
+    "result = IsoforestFeatureImportance(\n",
+    "    estimators=isofor.estimators_,\n",


IsoforestFeatureImportance should be given estimators_features=isofor.estimators_features_ to correctly map each tree’s split feature indices back to the global feature columns (sklearn fits each estimator on X[:, estimators_features_[i]]). Without it, gains can be attributed to the wrong features.

Suggested change

" estimators=isofor.estimators_,\n",

" estimators=isofor.estimators_,\n",

" estimators_features=isofor.estimators_features_,\n",

Copilot · 2026-02-12T17:06:47Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+        morph_platedf.groupby(treatment_col_name).agg(agg_mapping).reset_index()
+    )
+
+feat_cols = list(feat_cols)


feat_cols is built as a set and later converted to a list without sorting. This makes column order (and therefore IsolationForest.feature_names_in_ and downstream outputs) non-deterministic across runs/processes. Convert to a stable order (e.g., sorted(feat_cols)) before fitting/saving.

Suggested change

feat_cols = list(feat_cols)

feat_cols = sorted(feat_cols)

Copilot · 2026-02-12T17:06:47Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+    morphology_data=agg_platedf[list(isofor.feature_names_in_)],
+)()
+
+


IsoforestFeatureImportance needs the per-tree feature index mapping (IsolationForest.estimators_features_) to correctly traverse trees, because each estimator is fit on X[:, estimators_features_[i]] in scikit-learn (even when max_features=1.0, this can be a permutation). Calling it without estimators_features can compute gains using the wrong input feature at each split. Pass estimators_features=isofor.estimators_features_ here (and similarly for other call sites using sklearn IsolationForest).

Suggested change

morphology_data=agg_platedf[list(isofor.feature_names_in_)],

)()

estimators_features=isofor.estimators_features_,

morphology_data=agg_platedf[list(isofor.feature_names_in_)],

)()

Copilot · 2026-02-12T17:06:48Z

2.evaluate_data/compute_sc_anomalyze_data/compute_aggregate_treatment_anomaly_data.ipynb

+    "    if \"__file__\" in globals()\n",
+    "    else pathlib.Path.cwd()\n",
+    ")\n",
+    "utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n",


This utils_dir resolution works in a live notebook (no __file__), but the nbconvert-generated script will have __file__ defined and then script_dir.parent / "utils" points to compute_sc_anomalyze_data/utils (nonexistent), causing resolve(strict=True) to raise. Update the notebook to compute the utils path relative to 2.evaluate_data (e.g., script_dir.parents[1] / "utils") so the converted script runs.

Suggested change

"utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n",

"utils_dir = (script_dir.parents[1] / \"utils\").resolve(strict=True)\n",

axiomcura

LGTM, left some comment and suggestions

axiomcura · 2026-02-13T20:59:11Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+    if not feat_cols:
+        feat_cols = plate_feat_cols
+    else:
+        feat_cols &= plate_feat_cols


I am assuming that the plates here will have the the same features space. If not, this can trigger some unwantd errors where the &= will return an empty set.

axiomcura · 2026-02-13T21:00:19Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+    else:
+        feat_cols &= plate_feat_cols
+
+    agg_mapping = dict.fromkeys(feat_cols, "sum")


is there a possibility where you might get a "0" ? what would happened in this code?

axiomcura · 2026-02-13T21:02:15Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+
+metadf = agg_platedf.copy().filter(regex="Metadata|Result")
+
+result = IsoforestFeatureImportance(


I see this variable as "result" but it doesn't seem to be saved at all. The last usage here is in meta_resultdf but it seems that doesn't get saved as well.

axiomcura · 2026-02-13T21:04:34Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1)
+agg_platedf = agg_platedf.assign(
+    **{
+        "Result_inlier": isofor.fit_predict(agg_platedf[feat_cols]),
+        "Result_anomaly_score": isofor.decision_function(agg_platedf[feat_cols]),
+    }
+)


If it is not important, I would recommend added a check for NaN or inf values if that is something that you don't want in the results.

axiomcura · 2026-02-13T21:07:39Z

...luate_data/compute_sc_anomalyze_data/nbconverted/compute_aggregate_treatment_anomaly_data.py

+# In[6]:
+
+
+isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1)


Just a quick question - my understanding is that IsolationForest typically requires a large number of samples to reliably identify potential outliers. Since you’re working with aggregate profiles, the effective sample size is greatly reduced. I’m wondering whether the model can robustly handle this shift, given how drastically the number of profiles decreases after aggregation.

axiomcura · 2026-02-13T21:10:47Z

2.evaluate_data/utils/isolation_forest_data_feature_importance.py

+    Compute expected path length c(n) for an array of node sizes, where c(n)
+    is the average path length of an unsuccessful Binary Search Tree (BST) search (same as the
+    iForest remaining-path normalizer and leaf correction term).


Suggested change

Compute expected path length c(n) for an array of node sizes, where c(n)

is the average path length of an unsuccessful Binary Search Tree (BST) search (same as the

iForest remaining-path normalizer and leaf correction term).

Compute expected path length c(n) for an array of node sizes, where c(n)

is the average path length of an unsuccessful Binary Search Tree (BST) search (same

as the iForest remaining-path normalizer and leaf correction term).

line was too long

Cameron Mattson added 2 commits February 10, 2026 21:23

Added the aggregate anomaly feature importance code

61bbcc5

Added code to compute aggregate anomaly data

13940e8

Copilot AI review requested due to automatic review settings February 12, 2026 16:57

Copilot started reviewing on behalf of MattsonCam February 12, 2026 16:58 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

MattsonCam requested a review from axiomcura February 12, 2026 18:34

axiomcura approved these changes Feb 13, 2026

View reviewed changes

	"feat_cols = list(feat_cols)\n",
	"feat_cols = sorted(feat_cols)\n",

-   "source": [
+   "source": [
+    "# Merge per-feature contribution columns from `result` into `agg_platedf`\n",
+    "contrib_cols = [col for col in result.columns if col not in agg_platedf.columns]\n",
+    "if contrib_cols:\n",
+    "    agg_platedf = agg_platedf.join(result[contrib_cols], how=\"left\")\n",
+    "\n",

		/usr/bin/time -v python3 "$pypath/compute_sc_feature_importance_data.py"
		/usr/bin/time -v python3 "$pypath/compute_aggregate_treatment_anomaly_data.py"

	utils_dir = (script_dir.parent / "utils").resolve(strict=True)
	utils_dir = (script_dir.parents[1] / "utils").resolve(strict=True)

	" estimators=isofor.estimators_,\n",
	" estimators=isofor.estimators_,\n",
	" estimators_features=isofor.estimators_features_,\n",

		morphology_data=agg_platedf[list(isofor.feature_names_in_)],
		)()

	"utils_dir = (script_dir.parent / \"utils\").resolve(strict=True)\n",
	"utils_dir = (script_dir.parents[1] / \"utils\").resolve(strict=True)\n",


		metadf = agg_platedf.copy().filter(regex="Metadata\|Result")

		result = IsoforestFeatureImportance(

		# In[6]:


		isofor = IsolationForest(n_estimators=1_000, random_state=0, n_jobs=-1)

Conversation

MattsonCam commented Feb 12, 2026

Uh oh!

review-notebook-app bot commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

axiomcura left a comment

Choose a reason for hiding this comment

Uh oh!

axiomcura Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

axiomcura Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

axiomcura Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

axiomcura Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

axiomcura Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

axiomcura Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants