As a tool for cancer subtype prediction, Keraon uses features derived from cell-free DNA (cfDNA) in conjunction with PDX reference models to perform both classification and heterogeneous phenotype fraction estimation.
Keraon (Ceraon) is named for the Greek god of the ritualistic mixing of wine.
Like Keraon, this tool knows what went into the mix.
Keraon is part of The Pantheon, a suite of cfDNA processing and analysis tools. While each tool is useful independently, they are designed to work together:
| Tool | Purpose |
|---|---|
| Triton | cfDNA fragmentomic and phased-nucleosome feature extraction from BAM/CRAM files |
| Proteus | Prediction of tumor gene expression from cfDNA signal profiles |
| Keraon | Cancer subtype classification and mixture fraction estimation from cfDNA features |
Triton processes raw sequencing data into region-level features and signal profiles. Keraon then consumes Triton's tidy feature matrices to classify tumor subtypes and estimate their proportions. Proteus uses Triton signal profiles to predict underlying tumor gene expression.
- Description
- Quick Start
- Feature Recommendations
- NEPC Detection Thresholds
- Example Output
- Usage
- Worked Examples
- Outputs
- Methodology
- Supplied Reference Bases
- Requirements
- Contained Scripts
- Publications
- Citing Keraon
- Contact
- Acknowledgments
- License
Keraon utilizes features derived from cfDNA WGS to perform cancer phenotype classification (ctdPheno-GDA) and heterogeneous/fractional phenotype mixture estimation (Keraon). To do this, Keraon uses a panel of healthy donor and PDX samples which span the subtypes of interest as anchors. Bioinformatically pure circulating tumor DNA (ctDNA) features garnered from the PDX models, in conjunction with matched features from healthy donor cfDNA, are used to construct a latent space on which purity-adjusted predictions are based. Keraon yields both categorical classification scores and mixture estimates, and includes a de novo feature selection option based on simplex volume maximization (SVM) which works synergistically with both methods.
Keraon's primary use case is subtyping late-stage cancers and detecting potential trans-differentiation events. See published results for classifying and estimating fractions of castration-resistant prostate cancer (CRPC) adenocarcinoma (ARPC) from neuroendocrine-like (NEPC) phenotypes.
Note: The example config files and supplied bases in bases/ mirror the data described in the paper.
# 1. Install environment
micromamba create -f keraon_requirements.yml && micromamba activate keraon
# 2. Run inference with the supplied PC-ATAC basis
python Keraon.py \
-r bases/pc-atac_reference_model.pickle \
-i /path/to/your/TritonCompositeFM.tsv \
-t config/tfx_example.tsv \
-p config/palette_example.tsvKeraon supports any tidy feature matrix (columns: sample, site, feature, value), but two paradigms have been tested:
Composite, differentially open ATAC regions between phenotypes (e.g., ARPC-exclusive vs. NEPC-exclusive chromatin accessibility sites from the paper). The features central-depth and window-depth from Triton capture nucleosome-level occupancy differences at these sites and are the general recommendation. Use these with pre-selected features (-f config/site_features_example.txt) and the supplied pc-atac basis.
TFBS regions from Triton can also be used for phenotyping when phenotype-specific ATAC sites are unavailable. De novo SVM feature selection (--build_reference_model without -f) selects the most discriminative TFBS site/feature combinations automatically. This approach is not fully production-tested but provides a good starting point. Use with the supplied pc-tfbs basis.
Keraon is the primary output; ctdPheno-GDA is essentially a legacy classifier provided for comparison. Internal testing shows better performance with Keraon on both standard-depth and ultra-low-pass (ULP) WGS, with lower resolution at ULP depths.
For calling the presence of NEPC using the supplied reference bases:
| Basis | Score column | Threshold | AUC | Notes |
|---|---|---|---|---|
| PC-ATAC | NEPC_fraction |
> 0.041 | 0.807 | Fewer false positives; recommended |
| PC-TFBS | NEPC_fraction |
> 0.047 | 0.845 | Higher AUC but more false positives |
A sample is called NEPC-positive when its NEPC_fraction exceeds the threshold.
The figure below shows Keraon mixture predictions on the UW clinical cohort (standard-depth WGS) using the supplied pc-atac basis. Each column is a patient sample sorted by TFX. Stacked bars show estimated subtype fractions (top: total cfDNA fractions including Healthy; bottom: tumor burden composition).
| Flag | Description |
|---|---|
-r, --reference_data |
Required. Either a single pre-built reference_model.pickle (preferred) or one or more tidy .tsv feature matrices (requires -k). |
-i, --input_data |
Tidy-form test feature matrix .tsv with columns sample, site, feature, value. |
-t, --tfx |
.tsv with test sample names and estimated tumor fractions. Optional third column Truth for calibration/validation. |
-k, --reference_key |
Reference key .tsv (required when building from .tsv files). Three columns: sample, subtype, purity. One subtype must be Healthy with purity=0. |
| Flag | Description |
|---|---|
--build_reference_model |
Build and save a reference_model.pickle from reference .tsv + key. |
--calibrate |
Requires Truth labels; computes Youden thresholds + QC cutoffs; writes to results/calibration/. |
--positive_label |
Specifies which Truth label is the positive class for ROC/calibration. |
--model_out |
Path to write the built model (default: results/feature_analysis/reference_model.pickle). |
| Flag | Description |
|---|---|
-f, --features |
File with pre-selected site_feature combinations (one per line) to bypass SVM feature selection. Format: SiteName_feature-name (underscores in site/feature names are converted to dashes internally). |
-p, --palette |
.tsv mapping subtypes to hex colors. Subtype names must match those in -k and -t. |
When you have known differentially accessible regions (e.g., phenotype-specific ATAC sites), pass them with -f to skip SVM feature selection:
python Keraon.py \
-r /path/to/PDX_PC-ATAC_TritonCompositeFM.tsv \
/path/to/HD_PC-ATAC_TritonCompositeFM.tsv \
-k config/reference_key_example.tsv \
-p config/palette_example.tsv \
-f config/site_features_example.txt \
--build_reference_modelThis produces results/feature_analysis/reference_model.pickle. Copy it to bases/ for reuse. The supplied bases/pc-atac_reference_model.pickle was generated this way.
When using TFBS or other high-dimensional feature sets without a priori site selection, omit -f and let the SVM stability selection choose features automatically:
python Keraon.py \
-r /path/to/PDX_TFBS_TritonCompositeFM.tsv \
/path/to/HD_TFBS_TritonCompositeFM.tsv \
-k config/reference_key_example.tsv \
-p config/palette_example.tsv \
--build_reference_modelThe SVM evaluates a grid of hyperparameter combinations via bootstrapped stability selection (see Methodology), selects stable features, and saves the frozen hyperparameters and feature frequencies. The supplied bases/pc-tfbs_reference_model.pickle was generated this way.
Once you have a .pickle basis, inference requires only test data:
python Keraon.py \
-r bases/pc-atac_reference_model.pickle \
-i /path/to/your/TritonCompositeFM.tsv \
-t config/tfx_example.tsv \
-p config/palette_example.tsvIf Truth labels are present in the TFX file (third column), ROC curves are automatically generated.
Add --calibrate when Truth labels are available to compute optimal thresholds:
python Keraon.py \
-r bases/pc-atac_reference_model.pickle \
-i /path/to/your/TritonCompositeFM.tsv \
-t config/tfx_example.tsv \
-p config/palette_example.tsv \
--calibrateThis saves a reference_model.calibrated.pickle with embedded thresholds, plus full calibration reports and plots in results/calibration/.
To apply Keraon to a different cancer type or feature set:
- Prepare reference data: Run Triton on your PDX/reference samples and healthy donors. Ensure the feature matrix is in tidy form (
sample,site,feature,value). - Prepare a reference key: Create a
.tsvwith columnssample,subtype,purity. Include at least 3 samples per subtype, and one subtype must beHealthy(purity=0). PDX samples typically have purity=1. - Build: Run with
--build_reference_model(and optionally-fwith your pre-selected features). - Infer: Use the resulting
.picklewith-rfor all future inference runs.
results/
├── feature_analysis/
│ ├── reference_model.pickle # reusable ReferenceModel artifact
│ ├── reference_model.calibrated.pickle # (optional) model with embedded thresholds
│ ├── stability_selection.tsv # per-feature bootstrap selection frequencies (SVM only)
│ ├── svm_hyperparams_frozen.json # frozen SVM objective hyperparameters (SVM only)
│ ├── PCA_final-basis_wTestSamples.pdf # PCA of final feature set with test samples projected
│ ├── inference_plots/ # ROC and score distribution plots (if Truth provided)
│ └── feature_distributions/
│ ├── reference_features/ # per-feature distribution PDFs (reference, post-scaling)
│ ├── test_features/ # per-feature distribution PDFs (test, post-scaling)
│ └── final-basis_site-features/ # per-feature PDFs for reference + test overlay
│
├── ctdPheno_class-predictions/
│ ├── ctdPheno_class-predictions.tsv # NLL-based scoring and posterior class predictions
│ ├── ROC.pdf # (optional) ROC curve if Truth provided
│ └── <subtype>_predictions.pdf # per-subtype stick-and-ball visualization
│
├── keraon_mixture-predictions/
│ ├── Keraon_mixture-predictions.tsv # subtype fractions, burdens, and QC metrics
│ ├── factor_loadings.tsv # basis vectors (V) and off-target axes (U) per feature
│ ├── ROC_fraction.pdf # (optional) ROC curve if Truth provided
│ └── Keraon_mixture-predictions.pdf # stacked-bar visualization of fractions/burdens
│
└── calibration/ # (only with --calibrate)
├── calibration_predictions.tsv # merged predictions with Truth
├── calibration_thresholds.json # Youden thresholds + bootstrap CIs
├── calibration_report.tsv # summary statistics
└── calibration_plots/ # ROC + score distribution plots
| Column | Description |
|---|---|
TFX |
Provided tumor fraction for this sample |
logp_<subtype> |
Log-likelihood of the sample under the TFX-shifted mixture distribution for each subtype |
post_<subtype> |
Posterior probability of each subtype (softmax over log-likelihoods with class priors) |
predicted_class |
Subtype with the highest posterior probability |
Interpretation: post_NEPC close to 1.0 means the sample's feature profile, accounting for its tumor fraction, is most consistent with the NEPC reference. The posterior values across subtypes sum to 1.0 for each sample.
| Column | Description |
|---|---|
TFX |
Provided tumor fraction |
<subtype>_burden |
Estimated fraction of the tumor signal attributable to this subtype. All burdens (subtype + RA) sum to 1.0. |
<subtype>_fraction |
TFX × <subtype>_burden — the fraction of total cfDNA from this subtype |
Healthy_fraction |
1 − TFX — the cfDNA fraction from healthy cells |
RA<i>_coeff |
Signed projection coefficient onto residual axis i (off-target variation) |
RA<i>_energy |
Squared magnitude of the RAi component |
RA<i>_burden |
Fraction of tumor burden from off-target axis i |
RA<i>_fraction |
TFX × RA<i>_burden |
energy_subspace |
Squared norm of the sample's projection into the subtype span |
energy_offtarget |
Squared norm of the off-target (RA) component |
energy_residual_perp |
Squared norm of the unexplained residual (orthogonal to both subtype and RA subspaces) |
residual_perp_fraction |
energy_residual_perp / (energy_subspace + energy_offtarget + energy_residual_perp) — fraction of total signal unexplained by the model. High values suggest the sample does not conform to the reference subtypes. |
subtype_cone_misfit |
NNLS reconstruction error normalized by subspace energy. Values near 0 mean the sample lies within the simplex cone; higher values indicate the sample direction is not well-described by the reference subtype directions. |
FS_Region |
Simplex if NNLS coefficients sum to ≤ 1 (sample inside the simplex), else Non-Simplex. |
Interpretation: For each sample, all *_fraction columns (disease subtypes + RA axes + Healthy) sum to 1.0. The *_burden columns describe only the tumor compartment and also sum to 1.0. A sample with NEPC_fraction = 0.10 and TFX = 0.20 has 50% of its tumor burden explained by the NEPC direction (NEPC_burden = 0.50).
Each row is a feature (e.g., AR_central-depth); each column is a subtype direction or residual axis. Values are the components of the unit-length basis vectors V (disease subtypes) and U (RA off-target axes) in the original feature space. Features with large absolute loadings on a given axis contribute most to that axis's signal. This is analogous to PCA loadings and can be used to interpret which features drive each subtype's signature.
When --calibrate is used with Truth labels, Keraon computes Youden-optimal thresholds via bootstrap (500 resamples). The calibration_thresholds.json contains the median threshold, 95% CI, and associated Youden J statistic for both ctdPheno and Keraon scores.
The raw tidy feature matrix undergoes a multi-stage transformation implemented in load_triton_fm():
-
Optional point transforms — per-feature monotone transforms (e.g.,
log(1 + max(x, 0))for entropy and amplitude features) are applied to improve normality before any centering/scaling. -
Feature-wise standardization — for each (site, feature) pair across the reference samples:
where reference_model.pickle and applied identically to test data.
- Pivoting — the tidy-form data is pivoted to a (samples × features) matrix where each column is
site_feature.
When no pre-selected features are provided (-f omitted), Keraon performs de novo feature selection by choosing a subset of features that maximizes the geometric separation of subtype centroids in the feature space.
For a candidate feature mask
where:
| Symbol | Definition |
|---|---|
| Log-volume of the |
|
| Minimum pairwise Euclidean distance between centroids (margin term) | |
| Harmonic mean of all pairwise centroid distances (global edge regularity) | |
| Mean within-class scatter (root-sum of diagonal variances, whitened) | |
| Mean absolute off-diagonal correlation among selected features | |
| $ | \alpha |
| Tunable hyperparameters (see Stability Selection below) |
The whitening matrix
-
Seed — select one feature per non-Healthy subtype based on the highest combined Cohen's
$d$ × separation score (Mann-Whitney U tie-breaking). -
Prune — if the initial set exceeds
$k = |\text{subtypes}| - 1$ features, iteratively remove the feature whose removal least reduces the objective until exactly$k$ remain. - Refine — for each selected feature, scan all unselected features for a replacement that improves the objective. Repeat until no improving swap exists.
-
Grow — greedily add the single unselected feature yielding the largest objective increase. Stop when relative gain
$< 0.1%$ , or when the hard feature cap is reached (default: $\max(5 \times |\text{subtypes}|,, 50)$).
To avoid overfitting the hyperparameters
- For each combination in the hyperparameter grid (default:
$2^5 = 32$ combinations):- Draw
$B = 100$ stratified bootstrap resamples of the reference data (80% subsample rate per subtype). - Run the full SVM greedy algorithm on each bootstrap.
- Record which features were selected in each bootstrap.
- Draw
- Compute the selection frequency
$\hat{\pi}_f$ for each feature (fraction of bootstraps that selected it). - Compute mean pairwise Jaccard stability across bootstrap replicates.
- Features with
$\hat{\pi}_f > \theta$ (default$\theta = 0.10$ ) form the stable set. - The hyperparameter combination with the highest stability, tie-broken by full-data SVM objective and then by smallest feature set, is chosen.
The winning hyperparameters, selection frequencies, and stable feature list are frozen into the reference_model.pickle. Bootstrap resamples are parallelized across CPU cores.
ctdPheno-GDA performs purity-adjusted Gaussian discriminant analysis in a whitened feature space. (Note: Keraon mixture estimation is now the recommended primary output; ctdPheno is retained for compatibility and as a complementary categorical classifier.)
The global reference covariance
All class means
For a test sample
The covariance uses squared weights because the healthy and tumor components are treated as independent random variables contributing to the observed cfDNA signal.
The log-likelihood under a multivariate normal is:
The posterior probability of class
where
Keraon performs a geometric decomposition of each test sample's feature vector into subtype, off-target, and residual components.
-
Subtype mean vectors: For each subtype
$i$ , compute the mean feature vector$\boldsymbol{\mu}_i$ across reference samples. -
Directional vectors: Subtract the Healthy centroid to obtain disease directions:
-
Orthonormal basis via QR: Stack the
$\mathbf{v}_i$ as columns of$\mathbf{V}$ and compute$\mathbf{V} = \mathbf{Q}\mathbf{R}$ . The orthonormal columns$\mathbf{Q}_V$ span the subtype subspace. The projection matrices are:
Reference samples are centered by
For a test sample
-
Center:
$\mathbf{z} = \mathbf{x} - \boldsymbol{\mu}_{\text{Healthy}}$ -
Subtype projection:
$\mathbf{z}_{\parallel} = \mathbf{P},\mathbf{z}$ , with energy$E_S = |\mathbf{z}_{\parallel}|^2$ -
Off-target projection:
$\mathbf{z}_{\text{off}} = \mathbf{U},(\mathbf{U}^\top \mathbf{P}_{\perp}\mathbf{z})$ , with energy$E_O = |\mathbf{z}_{\text{off}}|^2$ -
Residual:
$\mathbf{z}_{\text{res}} = \mathbf{P}_{\perp}\mathbf{z} - \mathbf{z}_{\text{off}}$ , with energy$E_R = |\mathbf{z}_{\text{res}}|^2$
Energy decomposes exactly:
-
Modeled energy:
$E_M = E_S + E_O$ -
Subtype burden:
$b_S = E_S / E_M$ , off-target burden:$b_O = E_O / E_M$ -
Subtype partitioning via NNLS: Solve
$\min_{\mathbf{c} \geq 0} |\mathbf{V}\mathbf{c} - \mathbf{z}_{\parallel}|^2$ . The NNLS coefficients$\mathbf{c}$ are re-weighted by the Gram matrix$\mathbf{G} = \mathbf{V}^\top\mathbf{V}$ to produce Gram-weighted subtype burdens that sum to$b_S$ . - Normalization: All burdens (subtype + RA) are normalized to sum to exactly 1.0.
-
Fractions: Each burden is scaled by
$t$ (tumor fraction) to get the cfDNA fraction:$f_i = t \cdot b_i$ . Healthy fraction is$1 - t$ . All fractions sum to 1.0.
The bases/ directory contains two pre-built reference models that can be used directly for ARPC vs. NEPC phenotyping:
| File | Features | Sites | Description |
|---|---|---|---|
pc-atac_reference_model.pickle |
central-depth, window-depth |
AD-Exclusive, NE-Exclusive composite ATAC sites | Pre-selected phenotype-specific features; matches the paper |
pc-tfbs_reference_model.pickle |
SVM-selected | Transcription factor binding sites | De novo feature selection from TFBS features |
Both were built using the same LuCaP PDX + Healthy Donor reference panel described in the paper (21 ARPC, 8 NEPC, 20 Healthy; see config/reference_key_example.tsv).
Keraon uses standard scientific Python libraries and has been tested on Python 3.10-3.11. To create a tested environment:
# Install Micromamba (if not already installed):
# https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html
# Create and activate the environment:
micromamba create -f keraon_requirements.yml
micromamba activate keraon
# Verify:
micromamba listKey dependencies: NumPy, Pandas, SciPy, scikit-learn, Matplotlib, Seaborn.
| Script | Description |
|---|---|
Keraon.py |
Primary entry point for model building, inference, and calibration |
utils/keraon_utils.py |
Data loading, scaling, pivoting, and palette management |
utils/keraon_helpers.py |
SVM feature selection, simplex geometry, and stability selection |
utils/keraon_model.py |
Keraon mixture model fitting (basis construction) and prediction (decomposition) |
utils/ctdpheno_gda.py |
ctdPheno-GDA classifier (whitened Gaussian discriminant analysis) |
utils/whitening.py |
Covariance regularization and matrix square-root utilities |
utils/reference_builder.py |
Orchestrates model building (feature selection → model fitting → serialization) |
utils/reference_model.py |
ReferenceModel dataclass, pickle I/O, and TSV/JSON writers |
utils/calibration.py |
Youden threshold computation and bootstrap calibration |
utils/calibration_plots.py |
ROC and score distribution plotting |
utils/keraon_plotters.py |
PCA, ctdPheno stick-and-ball, Keraon stacked-bar, and feature distribution plots |
Nucleosome Patterns in Circulating Tumor DNA Reveal Transcriptional Regulation of Advanced Prostate Cancer Phenotypes
DOI: 10.1158/2159-8290.CD-22-0692
If you use Keraon in your research, please cite:
"Nucleosome Patterns in Circulating Tumor DNA Reveal Transcriptional Regulation of Advanced Prostate Cancer Phenotypes." Cancer Discovery 13(10), 2304–2325 (2023). https://doi.org/10.1158/2159-8290.CD-22-0692
If you have any questions or feedback, please contact me here on GitHub or at:
Email: rpatton@fredhutch.org
Keraon is developed and maintained by Robert D. Patton in the Gavin Ha Lab, Fred Hutchinson Cancer Center.
The MIT License (MIT)
Copyright (c) 2022 Fred Hutchinson Cancer Center
Permission is hereby granted, free of charge, to any government or not-for-profit entity, or to any person employed at one of the foregoing (each, an "Academic Licensee") who obtains a copy of this software and associated documentation files (the "Software"), to deal in the Software purely for non-commercial research and educational purposes, including the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or share copies of the Software, and to permit other Academic Licensees to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
No Academic Licensee shall be permitted to sell or use the Software or derivatives thereof in any service for commercial benefit. For the avoidance of doubt, any use by or transfer to a commercial entity shall be considered a commercial use and will require a separate license with Fred Hutchinson Cancer Center.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


