-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Summary
Polars supports arbitrary relational algebra, but Dataframely's Collection currently restricts cross-member relational structure to a single pattern: common primary key.
While workable for representing single "'semantic objects' which cannot be represented in a single data frame", the current design lacks native support for standard relational patterns (e.g., non-identifying relationships). Furthermore, the implicit and monolithic nature of the approach results in validation of referential integrity — a fundamental, structural constraint — being delegated to imperative and opaque @dy.filter methods.
This RFC proposes generalizing the Collection abstraction to be explicitly relational by introducing declarative foreign keys, enabling Dataframely to make Polars data pipelines more robust and readable regardless of relational pattern.
Motivation
Currently, Collection derives cross-member relationships implicitly by computing a _common_primary_key, defined as the intersection of primary-key columns (matching by name) across member Schemas.
This approach has several material limitations. Consider the example from the user guide:
class Invoice(dy.Schema):
invoice_id = dy.String(primary_key=True)
...
class Diagnosis(dy.Schema):
invoice_id = dy.String(primary_key=True)
diagnosis_code = dy.String(primary_key=True)
...
class HospitalClaims(dy.Collection):
invoices: dy.LazyFrame[Invoice]
diagnoses: dy.LazyFrame[Diagnosis]
@dy.filter()
def at_least_one_diagnosis_per_invoice(self) -> pl.LazyFrame:
return self.invoices.join(
self.diagnoses.select(pl.col("invoice_id").unique()),
on="invoice_id",
how="inner",
)Tightly-Coupled Schemas
In order to participate in a Collection, a Schema must forfeit independent control over its own column names and identity (i.e., primary key).
- Column Names:
InvoiceandDiagnosismust coordinate column names (seeinvoice_id) so thatHospitalClaims(theCollection) can "see" (i.e., implicitly derive) the relationship. - Identity (Primary Key):
Diagnosisis forced to use a composite primary key that includesinvoice_id, precluding the standard relational pattern whereDiagnosishas its own independent identity (e.g., a surrogateDiagnosis.idkey) and referencesInvoice.idvia aDiagnosis.invoice_idforeign key. Currently, the "N-side"Schemain 1-N relationships must be modeled as a weak entity; non-identifying relationships are impossible.
Opaque, Monolithic Relationships
Constraints on inter-member relationships (like at_least_one_diagnosis_per_invoice) are currently expressed via @dy.filter validation rules. This design is problematic for several reasons:
- Opaque:
@dy.filtervalidation rules are effectively black boxes; theCollectiondoes not "know about" the relationship constraints the rule expresses. - Imperative: The user must manually write the join (or remember to call a functional helper).
- Violates Separation of Concerns: The "maximum cardinality" (1-1 vs 1-N) of this relationship is fundamentally determined by whether the
Diagnosis.invoice_idforeign key has a unique constraint, which is aSchema(i.e.,Diagnosis) concern. - Conflates Validation Failures: The symmetrical inner join fails to make an important semantic distinction: if a LHS row is dropped, there exists an
Invoicethat doesn't have anyDiagnosis"children" (a domain rule failure), but if a RHS row is dropped, there exists aDiagnosisthat references anInvoicethat doesn't exist (a referential integrity failure). These distinct failures are both reported as failing the same monolithic validation rule. - Induces False-Positive Validation Failures: Given that the purpose of
@dy.filterrules is (ostensibly) to define arbitrary cross-member domain rules, validating them requires that members first be "pruned" (i.e., filtered) of rows that violate their respectiveSchemaconstraints (validate()callsfilter()in its implementation). When a structural relationship is expressed inside a@dy.filterrule, this destructive pruning phase induces false-positive validation failures (see Example below).
Example: "False Orphan" Validation Errors
If a row in
invoicesfails itsInvoice.amountcolumn'smin_exclusive=0constraint, it is removed.But now any rows in
diagnosesthat reference the removed row (viaDiagnosis.invoice_id) must be removed as well, as they reference a record that "doesn't exist".Consequently,
validate()reports thatdiagnosesfailed validation w.r.t. theat_least_one_diagnosis_per_invoicerule (for each row that references the invalid row ininvoices).This is not useful information in a validation context:
- The rule name suggests a failure of an "at-least-one-child" rule w.r.t.
invoices, whereas the actual "culprit" is a violation of referential integrity in the other direction.- There is no violation of referential integrity! The "false orphans" are an artifact of the destructive filtering process.
The failure of a node (
invoices) w.r.t. its internal invariants (~(Invoice.amount.col > 0)) is leaking, resulting in additional, false-positive relationship failures.
Imperative Data Generation
As noted in the data generation guide, dataframely cannot automatically infer 1-N relationships because they are expressed via opaque @dy.filter methods. Therefore, the user must write low-level, imperative Python code (overriding _preprocess_sample()) to sample Collections such as HospitalClaims.
Proposed API
The proposed API is designed to declare and leverage relational structure and to respect the separation of concerns outlined in the current documentation:
Schemais responsible for invariants within frames.Collectionis responsible for invariants across frames.
Foreign Keys
This RFC proposes introducing a foreign_keys declarative class attribute and a dy.ForeignKey primitive.
class HospitalClaims(dy.Collection):
invoices: dy.LazyFrame[Invoice]
diagnoses: dy.LazyFrame[Diagnosis]
foreign_keys = [
dy.ForeignKey(Diagnosis.invoice_id, references=Invoice.id),
]The following properties of a dy.ForeignKey relationship now emerge from properties of the foreign key Column(s) that define them (e.g., Diagnosis.invoice_id):
- Maximum Cardinality (1-1 vs 1-N): Determined by the
unique=property. - Minimum Cardinality (Optionality): Determined by the
nullable=property.
Whether each Invoice must have at least one Diagnosis "child" is actually a domain rule (i.e., not required for referential integrity). But this property can be elegantly declared via an optional parameter to dy.ForeignKey:
dy.ForeignKey(
Diagnosis.invoice_id,
references=Invoice.id,
require_at_least_one_child=True,
)Composite keys are naturally supported; the user simply passes a Sequence[Column].
class ForeignKey:
def __init__(
self,
columns: Column | Sequence[Column],
/,
*,
references: Column | Sequence[Column],
require_at_least_one_child: bool = False,
): ...Note
Pythonic precedent for declarative class attributes:
- Pydantic's
model_config - SQLAlchemy's declarative table configuration
Note
Self-referential relationships (e.g., Employee.manager_id references Employee.id) are out-of-scope for this API, as they constitute an intra-frame invariant and are therefore a Schema responsibility.
Unique Constraints
Given that the "maximum cardinality" (1-N vs 1-1) of a foreign key relationship is fundamentally determined by whether there is a unique constraint on the foreign key column(s), this RFC proposes making unique constraints a first-class, declarative property. This requires:
- a
unique: bool = Falseparameter for theColumns API - a
composite_unique_constraintsdeclarative class attribute forSchema
The latter enables Schemas to know about their composite unique constraints, which currently must be expressed via custom (i.e., opaque) @dy.rules. This is important given that foreign keys can be composite.
Cross-Member Rules
Explicit relational structure enables a declarative API for expressing domain rules across members connected by dy.ForeignKey relationships.
This RFC proposes a @dy.cross_member_rule API that abstracts away the construction of the relevant cross-member context.
class Order(dy.Schema):
id = dy.String(primary_key=True)
placed_at = dy.Datetime()
...
class Shipment(dy.Schema):
id = dy.String(primary_key=True)
order_id = dy.String()
dispatched_at = dy.Datetime()
...
class Fulfillments(dy.Collection):
orders: dy.LazyFrame[Order]
shipments: dy.LazyFrame[Shipment]
foreign_keys = [
dy.ForeignKey(Shipment.order_id, references=Order.id)
]
@dy.cross_member_rule()
def valid_dispatch_time(cls) -> pl.Expr:
return Shipment.dispatched_at.col >= Order.placed_at.colIf member Schemas have overlapping column names, rule expression is unaffected; the chore of namespacing is abstracted away.
...
# fully-qualified names used under the hood
return Shipment.timestamp.col >= Order.timestamp.colWhen multiple paths (of dy.ForeignKey relationships) exist between the members involved in a @dy.cross_member_rule, then "path qualification" is required to define a rule unambiguously. The user must therefore provide a paths= argument to the decorator, which defines a mapping from "path aliases" to relevant paths. Within the rule body, the fluent Column.via("alias") method enables declarative path qualification.
class Account(dy.Schema):
id = dy.String(primary_key=True)
currency_code = dy.String()
...
class Transfer(dy.Schema):
id = dy.String(primary_key=True)
source_account_id = dy.String()
dest_account_id = dy.String()
amount = dy.Decimal(min_exclusive=0)
exchange_rate = dy.Decimal(min_exclusive=0)
...
class Payments(dy.Collection):
accounts: dy.LazyFrame[Account]
transfers: dy.LazyFrame[Transfer]
foreign_keys = [
dy.ForeignKey(Transfer.source_account_id, references=Account.id),
dy.ForeignKey(Transfer.dest_account_id, references=Account.id),
]
@dy.cross_member_rule(paths={
"source": Transfer.source_account_id,
"dest": Transfer.dest_account_id,
})
def valid_intra_currency_fx_rate(cls) -> pl.Expr:
"""
For same-currency transfers, the Foreign Exchange (FX) rate must
be exactly `1.0`.
"""
is_same_currency = (
Account.currency_code.via("source").col
== Account.currency_code.via("dest").col
)
return ~is_same_currency | (Transfer.exchange_rate.col == 1.0)The distinct "roles" that the Account.currency_code column plays within the cross-member context are defined by the distinct relational paths via() which they are joined into that context.
Note
Progressive Disclosure: The user only needs to care about paths when the unambiguous expression of a rule requires it. If the user fails to provide a paths argument when required, Collection fails fast with a context-rich, actionable error message that calls out the specific topological ambiguity.
Note
No group_by= parameter is required; Polars window-function expressions perform aggregations on groups while preserving the grain of the context.
@dy.cross_member_rule()
def total_matches_lines(cls) -> pl.Expr:
return (
Invoice.total.col
== InvoiceLine.amount.col.sum().over(Invoice.id.name)
)Note
Handles "multi-hop" paths
Because a path alias binds to the entire relational path, any member Column along a "multi-hop" route can be qualified using the exact same alias.
A dy.ForeignKeyPath primitive eliminates the risk of conflating a "single-hop" path defined by a composite foreign key (Sequence[dy.Column]) with a multi-hop path (a contiguous Sequence of foreign keys). In the standard "single-hop" case, the user can simply pass a dy.Column or Sequence[dy.Column].
EdgeDef: TypeAlias = dy.Column | Sequence[dy.Column]
PathDef: TypeAlias = EdgeDef | dy.ForeignKeyPathCross-Member Filters
A lower-level API similar to @dy.filter is still required for expressing cross-member invariants involving more dynamic types of relationships. Use cases include expressing conditional participation constraints and constraints on relationships determined by temporal proximity (via join_asof).
This RFC proposes a slightly modified version, @dy.cross_member_filter, which receives a collection as input and must return a dict mapping the name of each relevant (i.e., filtered) member to a data frame where:
- the columns set contains the primary key of the member's
Schema - the rows represent the valid subset (w.r.t. the rule) of the member's original rows
class Fulfillments(dy.Collection):
orders: dy.LazyFrame[Order] # has categorical `status` column
shipments: dy.LazyFrame[Shipment]
foreign_keys = [
dy.ForeignKey(Shipment.order_id, references=Order.id)
]
# ... cross-member rules ...
@dy.cross_member_filter()
def shipped_orders_have_shipments(self) -> dict[str, pl.LazyFrame]:
shipped_orders = self.orders.filter(
Order.status.col.is_in(["SHIPPED", "DELIVERED"])
)
invalid_orders = shipped_orders.join(
self.shipments,
left_on=Order.id.name,
right_on=Shipment.order_id.name,
how="anti",
)
valid_orders = self.orders.join(
invalid_orders,
on=Order.id.name,
how="anti",
)
return {"orders": valid_orders}Declarative Data Generation
Explicit relational structure enables declarative data generation for Collections with 1-N dy.ForeignKey relationships.
HospitalClaims.sample(counts={
"invoices": 10,
"diagnoses": dy.random.Poisson(lam=1),
})Lightweight distribution objects from dy.random can be used to control the "cardinality behavior" of 1-N dy.ForeignKey relationships.
@dataclass
class Poisson:
lam: float
@dataclass
class Uniform:
upper_bound: int
lower_bound: int = 1Alternatively, the user can pass:
- an
int, which specifies a deterministic child count. - nothing, in which case
sample()uses a sensible default (e.g.,dy.Uniform(upper_bound=3)).
When a member is the "child" of multiple dy.ForeignKey relationships, then the user must specify which will act as the "driver" of that child's row count. This is done via the optional count_from parameter, which maps member names to foreign-key column(s).
Payments.sample(
counts={
"accounts": 10,
"transfers": dy.random.Poisson(lam=5),
},
count_from={"transfers": Transfer.source_account_id},
)If the user passes configuration to counts that is incoherent with respect to the defined structural properties of the Collection, sample() raises a ValueError detailing the structural mismatch. "Minimum cardinality" (optionality) constraints are enforced automatically under the hood.
Design Details
The declarative foreign keys API defines a DAG where member frames are nodes and dy.ForeignKey relationships are directed edges.
Collection Definition
Integrity Checks
Collection fails fast at class-creation time (raising an ImplementationError) if any of the following properties do not hold:
- Valid Columns: all
Columns passed tody.ForeignKeys belong to registered members (i.e., theirSchemas) - Valid Foreign Keys: all
dy.ForeignKeys referenceColumns that are the primary key of their respectiveSchemas (or are otherwise declared as unique) - Unique Schemas:
Schemas are unique across members (see callout below)
Note
Why enforce unique Schemas across members?
Given that Collections are used to define and validate multi-frame data contracts for data-pipeline functions, wrapping a Collection around fragmented pieces of the exact same entity (e.g., defining historic_invoices and current_invoices as separate members) is arguably a fundamental misuse of the abstraction, reflecting either a deferred concat operation or an attempt to use the name of a member to encode a dimension.
By enforcing this constraint, we are rewarded with a strictly-typed API; the user can pass actual Columns (i.e., Schema attributes) to dy.ForeignKey, and Collection (behind the scenes) can unambiguously resolve which members they belong to. The user enjoys full editor support (e.g., refactor safety).
(If a user genuinely needs multiple members with the exact same Schema, they can do so by subclassing, e.g., class HistoricInvoice(Invoice): pass.)
Foreign Keys Registration
The foreign_keys are parsed into a directed graph and topologically sorted (failing fast if a cycle is detected). This pre-computed topological metadata powers the filter() and sample() methods, which must execute in the correct dependency order.
Cross-Member Rule Resolution
At class-creation time, the metaclass extracts the polars.Exprs from the @dy.cross_member_rule functions. It performs this extraction within a strictly-scoped ambient context (via Python's contextvars.ContextVar). This ephemeral state toggles the behavior of Column's .col and .name properties such that they perform:
- Source Qualification: Emit "source-qualified" column expressions and names (formatted as
"{SchemaName}.{column_name}"). - Member Discovery: Register the column's "owner" (i.e.,
Schema) into the "trace" context upon property access.
A context manager guarantees that this this behavior is hermetically sealed within the @dy.cross_member_rule registration process.
The metaclass then uses the discovered members and the foreign_keys DAG to resolve the (how="inner") join plans. Treating the graph as undirected, it seeks the minimum connecting tree for the required members. If no such tree exists, or if multiple exist and the paths argument to the decorator is not provided, it raises a descriptive error. The resolved join plans are saved as metadata in the Collection alongside the extracted, qualified rule expressions.
Note
Safety: The @dy.cross_member_rule decorator (like @dy.rule) is a registration mechanism, used to bind a rule name to a static polars.Expr. The Collection completely owns the execution lifecycle of the decorated "expression factory" function.
Validation
In order for Collection.validate() to deliver precise diagnostic information, it must validate orthogonal constraints independently.
Structural Constraints
Validation of structural dy.ForeignKey constraints requires only that member rows are type-safe w.r.t. their Schema; full domain validity (as enforced by Schema.filter()) is an orthogonal concern. The following constraints are therefore validated for all type-safe member rows (which are isolated via Schema's DtypeCastRule mechanism).
- Referential Integrity: For each
dy.ForeignKeyinforeign_keys, perform an anti join from the referencing member to the referenced member; record any surviving rows asReferentialIntegrityFailures. - At-Least-One-Child: For each
dy.ForeignKeyinforeign_keyswhererequire_at_least_one_child=True, perform an anti join from the referenced ("parent") member to the referencing ("child") member; record any surviving rows asAtLeastOneChildFailures.
Note
Validation of "maximum cardinality" (1-N vs 1-1) is now a Schema responsibility.
Domain Constraints
Validation of cross-member domain constraints (rules & filters) requires that member rows fully respect their Schema constraints (to prevent Polars compute errors and ensure coherent domain logic). The following constraints are therefore validated for member rows that survive Schema.filter().
- Cross-Member Rules: Perform the resolved inner join(s) and filter the resulting cross-member context by the negation of the registered rule expression; record any surviving rows as
CrossMemberRuleFailures. - Cross-Member Filters: For each filtered member frame returned by the user's function, perform an anti join from the corresponding "baseline" version to the filtered one; record any surviving rows as
CrossMemberFilterFailures. The "baseline" version has been pruned of rows that violateSchemaconstraints (viaSchema.filter()) as well as rows that (potentially as a consequence) violate structuraldy.ForeignKeyconstraints, which prevents "double reporting" of these failures. ("Cascade" pruning for structural constraints is discussed in the Filtering section below.)
Filtering
Once member rows that violate either their respective Schema constraints or any cross-member domain constraints (rules & filters) have been filtered out, a two-phase, topologically-aware "filter cascade" guarantees integrity w.r.t. structural dy.ForeignKey relationship constraints:
- Downward Sweep (Referential Integrity): Traversing the DAG in topological order, perform semi joins from referenced ("parent") frames to the referencing ("child") frames, removing any child rows that were (or have become) orphaned.
- Upward Sweep (At-Least-One-Child): Reversing the traversal (for relationships where
require_at_least_one_child=True), perform semi joins from referencing ("child") frames to the referenced ("parent") frames, removing any parent rows that were (or have become) childless.
The entire process compiles to a single lazy query plan. The ignored_in_filters and propagate_row_failures parameters for CollectionMembers are no longer required.
Data Generation
Collection.sample() traverses the DAG of dy.ForeignKey relationships in topological order to automate dependent data generation using vectorized Polars.
First, "root" member frames are generated independently. Then, the passed counts configuration is used to assign a child count to each parent row (which is clipped to respect structural cardinality bounds). The parent's primary key values are then propagated to the child's foreign key via .repeat_by(child_count).explode().
When a member is the "child" of multiple dy.ForeignKey relationships, the count_from parameter dictates which parent acts as the "driver" for the row expansion. "Non-driving" foreign keys are resolved by randomly sampling from the primary keys of their respective parent frames.
Self-Aware Columns
To allow passing Schema attributes (i.e., Columns) directly to dy.ForeignKey, Column instances must know their parent Schema and name. We can achieve this by implementing __set_name__(self, owner, name) via the Descriptor Protocol (PEP 487). SchemaMeta._get_metadata_recursively will be updated to shallow-copy inherited columns so __set_name__ does not overwrite the owner on shared column instances.
Alternatives Considered
#295 also identifies the lack of support for foreign key relationships. However, this RFC does not believe the proposed solution method — an ORM approach — is the right fit for Dataframely.
# Example from #295
class Country(dy.Schema):
country_code = dy.String(primary_key=True)
capital = dy.String()
class CountryPair(dy.Schema):
a_country_code = dy.String(primary_key=True, foreign_key="country.country_code") # reference Collection attribute
b_country_code = dy.String(primary_key=True, foreign_key="country.country_code") # reference Collection attribute
distance = dy.Float64()
class MyCollection(dy.Collection):
country: dy.LazyFrame[Country]
country_pair: dy.LazyFrame[CountryPair]A foreign key relationship is inherently a cross-frame concern, making it a responsibility of Collection, not Schema.
Defining foreign key relationships at the Schema level would be problematic for several reasons; a Schema with a foreign key…
- must "know about" the
Schemait's referencing (and/or aCollection). - is no longer reusable (e.g., to compose separate
Collections). - cannot (by itself) validate the foreign key relationship it's declaring.
ORMs are designed for a fundamentally different domain than Dataframely:
- ORM table/model classes are used to define entire database schemas, whereas
Collections are used to define and validate data contracts for modular/local data-pipeline functions. - ORMs are designed for row-oriented OLTP, whereas Polars is designed for columnar OLAP.