Submitted discriminator miners are evaluated against a subset of the data sources listed below. Models are evaluated on cloud infrastructure -- miners do not need to host hardware for inference. A portion of the evaluation data comes from generative miners, who are rewarded based on their ability to submit data that both pass validator sanity checks (prompt alignment, etc.) and fool discriminators in benchmark runs.
Each modality (image, video, audio) is scored independently using the sn34_score metric, which combines discrimination accuracy (MCC) with calibration quality (Brier score).
Evaluation Datasets
Benchmark datasets are regularly expanded. Each modality includes a mix of real, synthetic, and semi-synthetic content from diverse sources (including continuously-updated GAS-Station data from generative miners).
Public datasets (available for training via gasbench):
- Image:
image_datasets.yaml - Video:
video_datasets.yaml - Audio:
audio_datasets.yaml
Holdout datasets: In addition to the public datasets above, each benchmark round includes holdout datasets that are not publicly available during the round. Holdout data is critical to ensure models generalize well and to mitigate overfitting. At the end of each round, many of the holdout datasets are released and added to the public gasbench datasets for future training. Some holdout datasets cannot be released publicly due to licensing or other restrictions.
Generative Models
The following models are run by validators to produce a continual, fresh stream of synthetic and semisynthetic data. The outputs of these models are uploaded at regular intervals to public datasets in the GAS-Station Hugging Face org for miner training and evaluation.
- stabilityai/stable-diffusion-xl-base-1.0
- SG161222/RealVisXL_V4.0
- Corcelio/mobius
- prompthero/openjourney-v4
- cagliostrolab/animagine-xl-3.1
- runwayml/stable-diffusion-v1-5 + Kvikontent/midjourney-v6 LoRA
- black-forest-labs/FLUX.1-dev
- DeepFloyd/IF
- deepseek-ai/Janus-Pro-7B
- THUDM/CogView4-6B
The generator incentive mechanism combines two components: a base reward for passing data validation checks, and a multiplier based on adversarial performance against discriminators.
Generators receive a base reward based on their data verification pass rate:
Where:
-
$p$ = pass rate (proportion of generated content that passes validation) -
$n$ = number of verified samples (min(n, 10)creates a rampup of incentive for the first 10 samples)
Generators earn additional rewards by successfully fooling discriminators. The multiplier is calculated as:
Where:
-
$f$ = fool rate =$\frac{N_{\text{fooled}}}{N_{\text{fooled}} + N_{\text{not fooled}}}$ -
$s$ = sample size multiplier
The sample size multiplier encourages generators to be evaluated on more samples, similar to the sample size ramp used in the base reward.
Where:
-
$c$ = total evaluation count (fooled + not fooled) - Reference count of 20 gives multiplier of 1.0
- Sample sizes below 20 are penalized
- Sample sizes above 20 receive logarithmic bonus up to 2.0x
The total generator reward combines both components:
This design incentivizes generators to:
- Produce high-quality, valid content (base reward)
- Create adversarially robust content that can fool discriminators (multiplier)
- Participate in more evaluations for sample size bonuses
Each discriminator model is scored per modality using the sn34_score, which combines two metrics:
-
Binary MCC (Matthews Correlation Coefficient) -- measures how well the model discriminates between real and synthetic content. Ranges from -1 (worst) to +1 (perfect).
-
Brier Score -- measures calibration quality (how well predicted probabilities match actual outcomes). Ranges from 0 (perfect) to 0.25 (random baseline).
These are combined as follows:
With default parameters
The discriminator competition is organized into rounds. Each round introduces new benchmark datasets and evaluates all submitted models. Winners are determined per modality (image, video, audio) independently.
- New round begins: Benchmark datasets are updated (new GAS-Station data, potentially new static datasets). All modalities share the same benchmark version number.
- Models are benchmarked: All submitted discriminator models are evaluated against the current round's datasets and scored using
sn34_score. - Winner determined per modality: The highest-scoring model for each modality wins that round.
- Alpha reward: The round winner for each modality receives an alpha reward.
Each round is winner-take-all -- only the top-scoring discriminator for each modality receives the alpha reward for that round. This incentivizes miners to continuously improve their models and push the state of the art in AI-generated content detection.
Rounds progress as benchmark versions are incremented, ensuring that models are always evaluated against fresh, evolving data.