Background
The Postgres implementation of betydb had a number of data constraints. These did a great job of enforcing data integrity (value ranges, foreign keys, uniqueness, etc). When migrating to a CSV based dataset, these database constraints are no longer automatically enforced.
Constraints are
Scope
It is not necessary to replicate all constraints, which would be a lot of work with diminishing returns (there are a lot!).
- translate PostgreSQL constraints to validation within R, and run by GH Actions
- focus on useful constraints only
- only focus on tables included in this repository
⚠️ Do not use the original design and implementation as a checklist.
⚠️ Don't overthink it. the original set of constraints took a lot of time to develop.
Approach
- Prioritize constraints
- Discuss implementation approach here. Some options:
- validation functions, e.g. in data-raw/validation.R, called by data-raw/make-data.R
- testthat tests, e.g. tests/testthat/test-data-constraints.R
- combination of above
- Other
Large chunks of this work, especially translating constraints to R, may be well suited to LLMs because they are formally defined and the most important ones are implemented in SQL.
Priorities
General Approach
Prioritize constraints that:
- prevent real data corruption
- avoid complex cross-table logic
- are easy to understand and maintain
Value constraints
- numeric ranges (e.g., percentages between 0–100, precipitation ≥0)
- positive counts (
n ≥ 0)
- sanity bounds for variables (min, max from variables table)
Uniqueness constraints
- natural keys that prevent duplicate rows
Non-NULL constraints
On critical fields required to interpret measurements; natural keys.
Standardization
- whitespace normalization
- canonical values for units or categorical variables
Deliverables
- R validation functions implementing key constraints
- automated tests ensuring they run in CI
- documentation describing which constraints are enforced
Background
The Postgres implementation of betydb had a number of data constraints. These did a great job of enforcing data integrity (value ranges, foreign keys, uniqueness, etc). When migrating to a CSV based dataset, these database constraints are no longer automatically enforced.
Constraints are
Scope
It is not necessary to replicate all constraints, which would be a lot of work with diminishing returns (there are a lot!).
Approach
Large chunks of this work, especially translating constraints to R, may be well suited to LLMs because they are formally defined and the most important ones are implemented in SQL.
Priorities
General Approach
Prioritize constraints that:
Value constraints
n ≥ 0)Uniqueness constraints
Non-NULL constraints
On critical fields required to interpret measurements; natural keys.
Standardization
Deliverables