Skip to content

Implement machine-readable source of truth for data constraints and model validation rules #22

@Abhishek-Kumar-Rai5

Description

@Abhishek-Kumar-Rai5

Problem

The BETYdb rails app enforces validation rules at the model layer (site.rb, trait.rb, cultivar.rb, etc.) but there is no consolidated, language-agnostic reference for these rules. Developers building ingestion pipelines, bulk upload tools, or extraction systems must manually read Rails source code to understand what data is acceptable before submitting.

This creates two real problems:

  • Uploads fail at the server with cryptic ActiveRecord errors instead of failing early with clear feedback
  • Every external tool (Python clients, R packages, custom pipelines) either re-implements these rules independently or skips validation entirely

Proposed Solution

Add a doc/validation_rules.yaml file that extracts all model-level validation rules into a structured, machine-readable format — covering required fields, numeric ranges, format constraints, uniqueness scopes, and conditional rules.

Example of what this enables:

# Any ingestion pipeline can do this before uploading:
if not (-90 <= lat <= 90):
    raise ValidationError("lat out of range")

Instead of discovering this only after a failed POST to the server.


Scope of Initial Extraction

Model Rules Covered
site.rb sitename required, lat/lon/masl ranges, soil % constraints, geometry co-specification rule
trait.rb mean + access_level required, per-variable mean range, date/time formats, stat/statname pairing
cultivar.rb name required, uniqueness scoped to specie_id

Preview of proposed YAML structure:

sites:
  required_fields:
    - sitename
  numeric_fields:
    lat:
      min: -90
      max: 90
    lon:
      min: -180
      max: 180
    masl:
      min: -418
      max: 8848
  composite_constraints:
    - rule: sand_pct + clay_pct <= 100

traits:
  required_fields:
    - mean
    - access_level
    - variable
  numeric_fields:
    access_level:
      allowed_values: [1, 2, 3, 4]
  conditional_requirements:
    - field: statname
      required_if: "stat is not blank"

cultivars:
  required_fields:
    - name
    - specie_id
  unique_constraints:
    - fields: [name, specie_id]
      message: "Cultivar name has already been used for this species"

Why This Matters for the Project

BETYdb is increasingly used as a backend for automated pipelines (TERRA REF, PEcAn, extraction systems). As the data entry surface grows beyond the web UI, pre-validation becomes critical for data quality.

A single maintained YAML file is a low-cost, high-value addition that:

  • Any tool in any language can consume directly
  • Stays close to the source — easy to keep in sync when models change
  • Serves as living documentation for contributors and integrators

Next Steps

I am happy to submit a PR with:

  1. doc/validation_rules.yaml — initial extraction covering the three models above
  2. doc/validation_rules_README.md — short usage guide explaining how external tools can consume the file

Please let me know if there are preferred conventions for the doc/ directory or if this belongs elsewhere in the repo.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions