-
Notifications
You must be signed in to change notification settings - Fork 0
Implement machine-readable source of truth for data constraints and model validation rules #22
Description
Problem
The BETYdb rails app enforces validation rules at the model layer (site.rb, trait.rb, cultivar.rb, etc.) but there is no consolidated, language-agnostic reference for these rules. Developers building ingestion pipelines, bulk upload tools, or extraction systems must manually read Rails source code to understand what data is acceptable before submitting.
This creates two real problems:
- Uploads fail at the server with cryptic ActiveRecord errors instead of failing early with clear feedback
- Every external tool (Python clients, R packages, custom pipelines) either re-implements these rules independently or skips validation entirely
Proposed Solution
Add a doc/validation_rules.yaml file that extracts all model-level validation rules into a structured, machine-readable format — covering required fields, numeric ranges, format constraints, uniqueness scopes, and conditional rules.
Example of what this enables:
# Any ingestion pipeline can do this before uploading:
if not (-90 <= lat <= 90):
raise ValidationError("lat out of range")Instead of discovering this only after a failed POST to the server.
Scope of Initial Extraction
| Model | Rules Covered |
|---|---|
site.rb |
sitename required, lat/lon/masl ranges, soil % constraints, geometry co-specification rule |
trait.rb |
mean + access_level required, per-variable mean range, date/time formats, stat/statname pairing |
cultivar.rb |
name required, uniqueness scoped to specie_id |
Preview of proposed YAML structure:
sites:
required_fields:
- sitename
numeric_fields:
lat:
min: -90
max: 90
lon:
min: -180
max: 180
masl:
min: -418
max: 8848
composite_constraints:
- rule: sand_pct + clay_pct <= 100
traits:
required_fields:
- mean
- access_level
- variable
numeric_fields:
access_level:
allowed_values: [1, 2, 3, 4]
conditional_requirements:
- field: statname
required_if: "stat is not blank"
cultivars:
required_fields:
- name
- specie_id
unique_constraints:
- fields: [name, specie_id]
message: "Cultivar name has already been used for this species"Why This Matters for the Project
BETYdb is increasingly used as a backend for automated pipelines (TERRA REF, PEcAn, extraction systems). As the data entry surface grows beyond the web UI, pre-validation becomes critical for data quality.
A single maintained YAML file is a low-cost, high-value addition that:
- Any tool in any language can consume directly
- Stays close to the source — easy to keep in sync when models change
- Serves as living documentation for contributors and integrators
Next Steps
I am happy to submit a PR with:
doc/validation_rules.yaml— initial extraction covering the three models abovedoc/validation_rules_README.md— short usage guide explaining how external tools can consume the file
Please let me know if there are preferred conventions for the doc/ directory or if this belongs elsewhere in the repo.