Skip to content

Classify dataset vs scalar inputs in DAG Schedule#573

Draft
javihern98 wants to merge 11 commits intomainfrom
cr-571
Draft

Classify dataset vs scalar inputs in DAG Schedule#573
javihern98 wants to merge 11 commits intomainfrom
cr-571

Conversation

@javihern98
Copy link
Contributor

Summary

Closes #571

Splits global_inputs into four categories in Schedule (renamed from DatasetSchedule), with global_inputs as their union:

Field When
global_input_datasets Definite datasets (dataset-only operators, joins, aggregations, identifiers)
global_input_scalars Definite scalars (scalar chain propagation, UDO scalar params, scalar-constrained params)
global_input_dataset_or_scalar Ambiguous top-level VarID (e.g., DS_r := X + 2)
global_input_component_or_scalar Inside clause context (e.g., DS_1[calc Me_2 := Me_1 + X])

Key changes:

  • StatementDeps: Added has_dataset_op, dataset_inputs, scalar_inputs fields for per-statement tracking
  • DAGAnalyzer: 21 visitor methods classify operands by operator semantics (dataset-only, dual, scalar-constrained)
  • Scalar propagation: Fixed-point algorithm seeds from constants and _resolved_from_unknown, propagates through pure-scalar chains
  • Dual-context fix: Variables in both dataset_inputs and scalar_inputs across statements → dataset_or_scalar
  • Consumers updated: _validate_extra_datasets, _extract_input_datasets, _save_datapoints_efficient
  • Operator coverage: All VTL 2.1 operators classified (dataset-only constants, scalar-constrained params for round/trunc/substr/replace/instr/between/random)

Checklist

  • Code quality checks pass (ruff format, ruff check, mypy)
  • Tests pass (pytest)
  • Documentation updated (if applicable)

Impact / Risk

  • Breaking changes? DatasetSchedule renamed to Schedule; new fields added (additive). global_inputs behavior unchanged (still the union of all categories).
  • Data/SDMX compatibility concerns? None — DAG analysis only, no runtime behavior change.
  • Notes for release/changelog? New Schedule fields enable precise dataset vs scalar discrimination for callers.

Notes

  • 207 DAG tests: 186 classification + 13 scheduling + 8 topological sort
  • Classification tests cover all VTL 2.1 operators plus 15 edge cases (dual-context variables, scalar chain propagation, unknown variable resolution, nested operators, deeply nested expressions, if-then-else with dataset-only branches)

… propagation

Track dataset operations per statement (has_dataset_op flag) and use
fixed-point propagation to identify scalar outputs. Classify global
inputs into: global_input_datasets, global_input_scalars,
global_input_dataset_or_scalar, and global_input_component_or_scalar.

Add 6 new DAG test cases (11-16) covering scalar chains, RegularAggregation
ambiguity, top-level ambiguity, and mixed contexts.
Set _current_has_dataset_op in visit_ParamOp, visit_UDOCall, and
visit_BinOp (MEMBERSHIP) so UDO dataset parameters and membership
operands are correctly classified as global_input_datasets.
Extend DAG analysis to discriminate between datasets, scalars,
dataset-or-scalar, and component-or-scalar global inputs. Each
dataset-only operator now explicitly marks its direct VarID operands
as dataset_inputs, so sub-expression operands (e.g., SC_1 in
count(DS_1 + SC_1)) are correctly classified as ambiguous rather
than forced to datasets.

Split test_dag.py into test_classification.py (154 cases) and
test_scheduling.py (13 non-trivial cases), inline VTL for simple
tests, and delete unused reference/VTL files.
- Add visit_EvalOp to mark eval operands as dataset_inputs
- Classify scalar-only parameters: random index, round/trunc/substr/
  replace/instr params, between from/to bounds
- Add tests 161-171 covering eval, random, and scalar-constrained params
…ests

Fix bug where a variable appearing in both dataset_inputs and
scalar_inputs across statements was classified as dataset instead of
dataset_or_scalar. Add 15 edge case tests (172-186) covering scalar
chain propagation, unknown variable resolution, nested operators,
deeply nested expressions, and if-then-else with dataset-only branches.
@javihern98
Copy link
Contributor Author

This one is posponed until further notice due to the possible impact and clarification on how we could benefit from this performance advantages, and if this could be used as well in the duckdb process. Scheduled to review again in end days of April

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discriminate between datasets and scalars in DatasetSchedule

1 participant