Skip to content

RFC: Declarative Foreign Keys for Collections #309

@haydena7

Description

Summary

Polars supports arbitrary relational algebra, but Dataframely's Collection currently restricts cross-member relational structure to a single pattern: common primary key.

While workable for representing single "'semantic objects' which cannot be represented in a single data frame", the current design lacks native support for standard relational patterns (e.g., non-identifying relationships). Furthermore, the implicit and monolithic nature of the approach results in validation of referential integrity — a fundamental, structural constraint — being delegated to imperative and opaque @dy.filter methods.

This RFC proposes generalizing the Collection abstraction to be explicitly relational by introducing declarative foreign keys, enabling Dataframely to make Polars data pipelines more robust and readable regardless of relational pattern.

Motivation

Currently, Collection derives cross-member relationships implicitly by computing a _common_primary_key, defined as the intersection of primary-key columns (matching by name) across member Schemas.

This approach has several material limitations. Consider the example from the user guide:

class Invoice(dy.Schema):
	invoice_id = dy.String(primary_key=True)
	...


class Diagnosis(dy.Schema):
	invoice_id = dy.String(primary_key=True)
	diagnosis_code = dy.String(primary_key=True)
	...


class HospitalClaims(dy.Collection):
	invoices: dy.LazyFrame[Invoice]
	diagnoses: dy.LazyFrame[Diagnosis]
	
	@dy.filter()
    def at_least_one_diagnosis_per_invoice(self) -> pl.LazyFrame:
	    return self.invoices.join(
		    self.diagnoses.select(pl.col("invoice_id").unique()),
		    on="invoice_id",
		    how="inner",
	    )

Tightly-Coupled Schemas

In order to participate in a Collection, a Schema must forfeit independent control over its own column names and identity (i.e., primary key).

  • Column Names: Invoice and Diagnosis must coordinate column names (see invoice_id) so that HospitalClaims (the Collection) can "see" (i.e., implicitly derive) the relationship.
  • Identity (Primary Key): Diagnosis is forced to use a composite primary key that includes invoice_id, precluding the standard relational pattern where Diagnosis has its own independent identity (e.g., a surrogate Diagnosis.id key) and references Invoice.id via a Diagnosis.invoice_id foreign key. Currently, the "N-side" Schema in 1-N relationships must be modeled as a weak entity; non-identifying relationships are impossible.

Opaque, Monolithic Relationships

Constraints on inter-member relationships (like at_least_one_diagnosis_per_invoice) are currently expressed via @dy.filter validation rules. This design is problematic for several reasons:

  • Opaque: @dy.filter validation rules are effectively black boxes; the Collection does not "know about" the relationship constraints the rule expresses.
  • Imperative: The user must manually write the join (or remember to call a functional helper).
  • Violates Separation of Concerns: The "maximum cardinality" (1-1 vs 1-N) of this relationship is fundamentally determined by whether the Diagnosis.invoice_id foreign key has a unique constraint, which is a Schema (i.e., Diagnosis) concern.
  • Conflates Validation Failures: The symmetrical inner join fails to make an important semantic distinction: if a LHS row is dropped, there exists an Invoice that doesn't have any Diagnosis "children" (a domain rule failure), but if a RHS row is dropped, there exists a Diagnosis that references an Invoice that doesn't exist (a referential integrity failure). These distinct failures are both reported as failing the same monolithic validation rule.
  • Induces False-Positive Validation Failures: Given that the purpose of @dy.filter rules is (ostensibly) to define arbitrary cross-member domain rules, validating them requires that members first be "pruned" (i.e., filtered) of rows that violate their respective Schema constraints (validate() calls filter() in its implementation). When a structural relationship is expressed inside a @dy.filter rule, this destructive pruning phase induces false-positive validation failures (see Example below).

Example: "False Orphan" Validation Errors

If a row in invoices fails its Invoice.amount column's min_exclusive=0 constraint, it is removed.

But now any rows in diagnoses that reference the removed row (via Diagnosis.invoice_id) must be removed as well, as they reference a record that "doesn't exist".

Consequently, validate() reports that diagnoses failed validation w.r.t. the at_least_one_diagnosis_per_invoice rule (for each row that references the invalid row in invoices).

This is not useful information in a validation context:

  1. The rule name suggests a failure of an "at-least-one-child" rule w.r.t. invoices, whereas the actual "culprit" is a violation of referential integrity in the other direction.
  2. There is no violation of referential integrity! The "false orphans" are an artifact of the destructive filtering process.

The failure of a node (invoices) w.r.t. its internal invariants (~(Invoice.amount.col > 0)) is leaking, resulting in additional, false-positive relationship failures.

Imperative Data Generation

As noted in the data generation guide, dataframely cannot automatically infer 1-N relationships because they are expressed via opaque @dy.filter methods. Therefore, the user must write low-level, imperative Python code (overriding _preprocess_sample()) to sample Collections such as HospitalClaims.

Proposed API

The proposed API is designed to declare and leverage relational structure and to respect the separation of concerns outlined in the current documentation:

  • Schema is responsible for invariants within frames.
  • Collection is responsible for invariants across frames.

Foreign Keys

This RFC proposes introducing a foreign_keys declarative class attribute and a dy.ForeignKey primitive.

class HospitalClaims(dy.Collection):
    invoices: dy.LazyFrame[Invoice]
    diagnoses: dy.LazyFrame[Diagnosis]
    
    foreign_keys = [
	    dy.ForeignKey(Diagnosis.invoice_id, references=Invoice.id),
    ]

The following properties of a dy.ForeignKey relationship now emerge from properties of the foreign key Column(s) that define them (e.g., Diagnosis.invoice_id):

  • Maximum Cardinality (1-1 vs 1-N): Determined by the unique= property.
  • Minimum Cardinality (Optionality): Determined by the nullable= property.

Whether each Invoice must have at least one Diagnosis "child" is actually a domain rule (i.e., not required for referential integrity). But this property can be elegantly declared via an optional parameter to dy.ForeignKey:

dy.ForeignKey(
	Diagnosis.invoice_id,
	references=Invoice.id,
	require_at_least_one_child=True,
)

Composite keys are naturally supported; the user simply passes a Sequence[Column].

class ForeignKey:
    def __init__(
        self,
        columns: Column | Sequence[Column],
        /,
        *,
        references: Column | Sequence[Column],
        require_at_least_one_child: bool = False,
    ): ...

Note

Pythonic precedent for declarative class attributes:

Note

Self-referential relationships (e.g., Employee.manager_id references Employee.id) are out-of-scope for this API, as they constitute an intra-frame invariant and are therefore a Schema responsibility.

Unique Constraints

Given that the "maximum cardinality" (1-N vs 1-1) of a foreign key relationship is fundamentally determined by whether there is a unique constraint on the foreign key column(s), this RFC proposes making unique constraints a first-class, declarative property. This requires:

  • a unique: bool = False parameter for the Columns API
  • a composite_unique_constraints declarative class attribute for Schema

The latter enables Schemas to know about their composite unique constraints, which currently must be expressed via custom (i.e., opaque) @dy.rules. This is important given that foreign keys can be composite.

Cross-Member Rules

Explicit relational structure enables a declarative API for expressing domain rules across members connected by dy.ForeignKey relationships.

This RFC proposes a @dy.cross_member_rule API that abstracts away the construction of the relevant cross-member context.

class Order(dy.Schema):
	id = dy.String(primary_key=True)
	placed_at = dy.Datetime()
	...


class Shipment(dy.Schema):
	id = dy.String(primary_key=True)
	order_id = dy.String()
	dispatched_at = dy.Datetime()
	...


class Fulfillments(dy.Collection):
	orders: dy.LazyFrame[Order]
	shipments: dy.LazyFrame[Shipment]
	
	foreign_keys = [
		dy.ForeignKey(Shipment.order_id, references=Order.id)
	]
	
	@dy.cross_member_rule()
	def valid_dispatch_time(cls) -> pl.Expr:
		return Shipment.dispatched_at.col >= Order.placed_at.col

If member Schemas have overlapping column names, rule expression is unaffected; the chore of namespacing is abstracted away.

	...
		# fully-qualified names used under the hood
		return Shipment.timestamp.col >= Order.timestamp.col

When multiple paths (of dy.ForeignKey relationships) exist between the members involved in a @dy.cross_member_rule, then "path qualification" is required to define a rule unambiguously. The user must therefore provide a paths= argument to the decorator, which defines a mapping from "path aliases" to relevant paths. Within the rule body, the fluent Column.via("alias") method enables declarative path qualification.

class Account(dy.Schema):
	id = dy.String(primary_key=True)
	currency_code = dy.String()
	...


class Transfer(dy.Schema):
	id = dy.String(primary_key=True)
	source_account_id = dy.String()
	dest_account_id = dy.String()
	amount = dy.Decimal(min_exclusive=0)
	exchange_rate = dy.Decimal(min_exclusive=0)
	...


class Payments(dy.Collection):
	accounts: dy.LazyFrame[Account]
	transfers: dy.LazyFrame[Transfer]
	
	foreign_keys = [
		dy.ForeignKey(Transfer.source_account_id, references=Account.id),
		dy.ForeignKey(Transfer.dest_account_id, references=Account.id),
	]
	
	@dy.cross_member_rule(paths={
		"source": Transfer.source_account_id,
		"dest": Transfer.dest_account_id,
	})
	def valid_intra_currency_fx_rate(cls) -> pl.Expr:
		"""
		For same-currency transfers, the Foreign Exchange (FX) rate must
		be exactly `1.0`.
		"""
		is_same_currency = (
			Account.currency_code.via("source").col
			== Account.currency_code.via("dest").col
		)
		return ~is_same_currency | (Transfer.exchange_rate.col == 1.0)

The distinct "roles" that the Account.currency_code column plays within the cross-member context are defined by the distinct relational paths via() which they are joined into that context.

Note

Progressive Disclosure: The user only needs to care about paths when the unambiguous expression of a rule requires it. If the user fails to provide a paths argument when required, Collection fails fast with a context-rich, actionable error message that calls out the specific topological ambiguity.

Note

No group_by= parameter is required; Polars window-function expressions perform aggregations on groups while preserving the grain of the context.

@dy.cross_member_rule()
def total_matches_lines(cls) -> pl.Expr:
	return (
		Invoice.total.col
		== InvoiceLine.amount.col.sum().over(Invoice.id.name)
	)

Note

Handles "multi-hop" paths

Because a path alias binds to the entire relational path, any member Column along a "multi-hop" route can be qualified using the exact same alias.

A dy.ForeignKeyPath primitive eliminates the risk of conflating a "single-hop" path defined by a composite foreign key (Sequence[dy.Column]) with a multi-hop path (a contiguous Sequence of foreign keys). In the standard "single-hop" case, the user can simply pass a dy.Column or Sequence[dy.Column].

EdgeDef: TypeAlias = dy.Column | Sequence[dy.Column]
PathDef: TypeAlias = EdgeDef | dy.ForeignKeyPath

Cross-Member Filters

A lower-level API similar to @dy.filter is still required for expressing cross-member invariants involving more dynamic types of relationships. Use cases include expressing conditional participation constraints and constraints on relationships determined by temporal proximity (via join_asof).

This RFC proposes a slightly modified version, @dy.cross_member_filter, which receives a collection as input and must return a dict mapping the name of each relevant (i.e., filtered) member to a data frame where:

  • the columns set contains the primary key of the member's Schema
  • the rows represent the valid subset (w.r.t. the rule) of the member's original rows
class Fulfillments(dy.Collection):
	orders: dy.LazyFrame[Order]  # has categorical `status` column
	shipments: dy.LazyFrame[Shipment]
	
	foreign_keys = [
		dy.ForeignKey(Shipment.order_id, references=Order.id)
	]
	
	# ... cross-member rules ...
	
	@dy.cross_member_filter()
	def shipped_orders_have_shipments(self) -> dict[str, pl.LazyFrame]:
		shipped_orders = self.orders.filter(
			Order.status.col.is_in(["SHIPPED", "DELIVERED"])
		)
		invalid_orders = shipped_orders.join(
			self.shipments,
			left_on=Order.id.name,
			right_on=Shipment.order_id.name,
			how="anti",
		)
		valid_orders = self.orders.join(
			invalid_orders,
			on=Order.id.name,
			how="anti",
		)
		return {"orders": valid_orders}

Declarative Data Generation

Explicit relational structure enables declarative data generation for Collections with 1-N dy.ForeignKey relationships.

HospitalClaims.sample(counts={
	"invoices": 10,
	"diagnoses": dy.random.Poisson(lam=1),
})

Lightweight distribution objects from dy.random can be used to control the "cardinality behavior" of 1-N dy.ForeignKey relationships.

@dataclass
class Poisson:
    lam: float


@dataclass
class Uniform:
    upper_bound: int
    lower_bound: int = 1

Alternatively, the user can pass:

  • an int, which specifies a deterministic child count.
  • nothing, in which case sample() uses a sensible default (e.g., dy.Uniform(upper_bound=3)).

When a member is the "child" of multiple dy.ForeignKey relationships, then the user must specify which will act as the "driver" of that child's row count. This is done via the optional count_from parameter, which maps member names to foreign-key column(s).

Payments.sample(
    counts={
        "accounts": 10,
        "transfers": dy.random.Poisson(lam=5),
    },
    count_from={"transfers": Transfer.source_account_id},
)

If the user passes configuration to counts that is incoherent with respect to the defined structural properties of the Collection, sample() raises a ValueError detailing the structural mismatch. "Minimum cardinality" (optionality) constraints are enforced automatically under the hood.

Design Details

The declarative foreign keys API defines a DAG where member frames are nodes and dy.ForeignKey relationships are directed edges.

Collection Definition

Integrity Checks

Collection fails fast at class-creation time (raising an ImplementationError) if any of the following properties do not hold:

  • Valid Columns: all Columns passed to dy.ForeignKeys belong to registered members (i.e., their Schemas)
  • Valid Foreign Keys: all dy.ForeignKeys reference Columns that are the primary key of their respective Schemas (or are otherwise declared as unique)
  • Unique Schemas: Schemas are unique across members (see callout below)

Note

Why enforce unique Schemas across members?

Given that Collections are used to define and validate multi-frame data contracts for data-pipeline functions, wrapping a Collection around fragmented pieces of the exact same entity (e.g., defining historic_invoices and current_invoices as separate members) is arguably a fundamental misuse of the abstraction, reflecting either a deferred concat operation or an attempt to use the name of a member to encode a dimension.

By enforcing this constraint, we are rewarded with a strictly-typed API; the user can pass actual Columns (i.e., Schema attributes) to dy.ForeignKey, and Collection (behind the scenes) can unambiguously resolve which members they belong to. The user enjoys full editor support (e.g., refactor safety).

(If a user genuinely needs multiple members with the exact same Schema, they can do so by subclassing, e.g., class HistoricInvoice(Invoice): pass.)

Foreign Keys Registration

The foreign_keys are parsed into a directed graph and topologically sorted (failing fast if a cycle is detected). This pre-computed topological metadata powers the filter() and sample() methods, which must execute in the correct dependency order.

Cross-Member Rule Resolution

At class-creation time, the metaclass extracts the polars.Exprs from the @dy.cross_member_rule functions. It performs this extraction within a strictly-scoped ambient context (via Python's contextvars.ContextVar). This ephemeral state toggles the behavior of Column's .col and .name properties such that they perform:

  • Source Qualification: Emit "source-qualified" column expressions and names (formatted as "{SchemaName}.{column_name}").
  • Member Discovery: Register the column's "owner" (i.e., Schema) into the "trace" context upon property access.

A context manager guarantees that this this behavior is hermetically sealed within the @dy.cross_member_rule registration process.

The metaclass then uses the discovered members and the foreign_keys DAG to resolve the (how="inner") join plans. Treating the graph as undirected, it seeks the minimum connecting tree for the required members. If no such tree exists, or if multiple exist and the paths argument to the decorator is not provided, it raises a descriptive error. The resolved join plans are saved as metadata in the Collection alongside the extracted, qualified rule expressions.

Note

Safety: The @dy.cross_member_rule decorator (like @dy.rule) is a registration mechanism, used to bind a rule name to a static polars.Expr. The Collection completely owns the execution lifecycle of the decorated "expression factory" function.

Validation

In order for Collection.validate() to deliver precise diagnostic information, it must validate orthogonal constraints independently.

Structural Constraints

Validation of structural dy.ForeignKey constraints requires only that member rows are type-safe w.r.t. their Schema; full domain validity (as enforced by Schema.filter()) is an orthogonal concern. The following constraints are therefore validated for all type-safe member rows (which are isolated via Schema's DtypeCastRule mechanism).

  • Referential Integrity: For each dy.ForeignKey in foreign_keys, perform an anti join from the referencing member to the referenced member; record any surviving rows as ReferentialIntegrityFailures.
  • At-Least-One-Child: For each dy.ForeignKey in foreign_keys where require_at_least_one_child=True, perform an anti join from the referenced ("parent") member to the referencing ("child") member; record any surviving rows as AtLeastOneChildFailures.

Note

Validation of "maximum cardinality" (1-N vs 1-1) is now a Schema responsibility.

Domain Constraints

Validation of cross-member domain constraints (rules & filters) requires that member rows fully respect their Schema constraints (to prevent Polars compute errors and ensure coherent domain logic). The following constraints are therefore validated for member rows that survive Schema.filter().

  • Cross-Member Rules: Perform the resolved inner join(s) and filter the resulting cross-member context by the negation of the registered rule expression; record any surviving rows as CrossMemberRuleFailures.
  • Cross-Member Filters: For each filtered member frame returned by the user's function, perform an anti join from the corresponding "baseline" version to the filtered one; record any surviving rows as CrossMemberFilterFailures. The "baseline" version has been pruned of rows that violate Schema constraints (via Schema.filter()) as well as rows that (potentially as a consequence) violate structural dy.ForeignKey constraints, which prevents "double reporting" of these failures. ("Cascade" pruning for structural constraints is discussed in the Filtering section below.)

Filtering

Once member rows that violate either their respective Schema constraints or any cross-member domain constraints (rules & filters) have been filtered out, a two-phase, topologically-aware "filter cascade" guarantees integrity w.r.t. structural dy.ForeignKey relationship constraints:

  1. Downward Sweep (Referential Integrity): Traversing the DAG in topological order, perform semi joins from referenced ("parent") frames to the referencing ("child") frames, removing any child rows that were (or have become) orphaned.
  2. Upward Sweep (At-Least-One-Child): Reversing the traversal (for relationships where require_at_least_one_child=True), perform semi joins from referencing ("child") frames to the referenced ("parent") frames, removing any parent rows that were (or have become) childless.

The entire process compiles to a single lazy query plan. The ignored_in_filters and propagate_row_failures parameters for CollectionMembers are no longer required.

Data Generation

Collection.sample() traverses the DAG of dy.ForeignKey relationships in topological order to automate dependent data generation using vectorized Polars.

First, "root" member frames are generated independently. Then, the passed counts configuration is used to assign a child count to each parent row (which is clipped to respect structural cardinality bounds). The parent's primary key values are then propagated to the child's foreign key via .repeat_by(child_count).explode().

When a member is the "child" of multiple dy.ForeignKey relationships, the count_from parameter dictates which parent acts as the "driver" for the row expansion. "Non-driving" foreign keys are resolved by randomly sampling from the primary keys of their respective parent frames.

Self-Aware Columns

To allow passing Schema attributes (i.e., Columns) directly to dy.ForeignKey, Column instances must know their parent Schema and name. We can achieve this by implementing __set_name__(self, owner, name) via the Descriptor Protocol (PEP 487). SchemaMeta._get_metadata_recursively will be updated to shallow-copy inherited columns so __set_name__ does not overwrite the owner on shared column instances.

Alternatives Considered

#295 also identifies the lack of support for foreign key relationships. However, this RFC does not believe the proposed solution method — an ORM approach — is the right fit for Dataframely.

# Example from #295

class Country(dy.Schema):
    country_code = dy.String(primary_key=True)
    capital = dy.String()

class CountryPair(dy.Schema):
    a_country_code = dy.String(primary_key=True, foreign_key="country.country_code") # reference Collection attribute
    b_country_code = dy.String(primary_key=True, foreign_key="country.country_code") # reference Collection attribute
    distance = dy.Float64()

class MyCollection(dy.Collection):
    country: dy.LazyFrame[Country]
    country_pair: dy.LazyFrame[CountryPair]

A foreign key relationship is inherently a cross-frame concern, making it a responsibility of Collection, not Schema.

Defining foreign key relationships at the Schema level would be problematic for several reasons; a Schema with a foreign key…

  • must "know about" the Schema it's referencing (and/or a Collection).
  • is no longer reusable (e.g., to compose separate Collections).
  • cannot (by itself) validate the foreign key relationship it's declaring.

ORMs are designed for a fundamentally different domain than Dataframely:

  • ORM table/model classes are used to define entire database schemas, whereas Collections are used to define and validate data contracts for modular/local data-pipeline functions.
  • ORMs are designed for row-oriented OLTP, whereas Polars is designed for columnar OLAP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions