Classify dataset vs scalar inputs in DAG Schedule by javihern98 · Pull Request #573 · Meaningful-Data/vtlengine

javihern98 · 2026-03-05T18:16:33Z

Summary

Closes #571

Splits global_inputs into four categories in Schedule (renamed from DatasetSchedule), with global_inputs as their union:

Field	When
`global_input_datasets`	Definite datasets (dataset-only operators, joins, aggregations, identifiers)
`global_input_scalars`	Definite scalars (scalar chain propagation, UDO scalar params, scalar-constrained params)
`global_input_dataset_or_scalar`	Ambiguous top-level VarID (e.g., `DS_r := X + 2`)
`global_input_component_or_scalar`	Inside clause context (e.g., `DS_1[calc Me_2 := Me_1 + X]`)

Key changes:

StatementDeps: Added has_dataset_op, dataset_inputs, scalar_inputs fields for per-statement tracking
DAGAnalyzer: 21 visitor methods classify operands by operator semantics (dataset-only, dual, scalar-constrained)
Scalar propagation: Fixed-point algorithm seeds from constants and _resolved_from_unknown, propagates through pure-scalar chains
Dual-context fix: Variables in both dataset_inputs and scalar_inputs across statements → dataset_or_scalar
Consumers updated: _validate_extra_datasets, _extract_input_datasets, _save_datapoints_efficient
Operator coverage: All VTL 2.1 operators classified (dataset-only constants, scalar-constrained params for round/trunc/substr/replace/instr/between/random)

Checklist

Code quality checks pass (ruff format, ruff check, mypy)
Tests pass (pytest)
Documentation updated (if applicable)

Impact / Risk

Breaking changes? DatasetSchedule renamed to Schedule; new fields added (additive). global_inputs behavior unchanged (still the union of all categories).
Data/SDMX compatibility concerns? None — DAG analysis only, no runtime behavior change.
Notes for release/changelog? New Schedule fields enable precise dataset vs scalar discrimination for callers.

Notes

207 DAG tests: 186 classification + 13 scheduling + 8 topological sort
Classification tests cover all VTL 2.1 operators plus 15 edge cases (dual-context variables, scalar chain propagation, unknown variable resolution, nested operators, deeply nested expressions, if-then-else with dataset-only branches)

… DatasetSchedule

… propagation Track dataset operations per statement (has_dataset_op flag) and use fixed-point propagation to identify scalar outputs. Classify global inputs into: global_input_datasets, global_input_scalars, global_input_dataset_or_scalar, and global_input_component_or_scalar. Add 6 new DAG test cases (11-16) covering scalar chains, RegularAggregation ambiguity, top-level ambiguity, and mixed contexts.

Set _current_has_dataset_op in visit_ParamOp, visit_UDOCall, and visit_BinOp (MEMBERSHIP) so UDO dataset parameters and membership operands are correctly classified as global_input_datasets.

…tputs

Extend DAG analysis to discriminate between datasets, scalars, dataset-or-scalar, and component-or-scalar global inputs. Each dataset-only operator now explicitly marks its direct VarID operands as dataset_inputs, so sub-expression operands (e.g., SC_1 in count(DS_1 + SC_1)) are correctly classified as ambiguous rather than forced to datasets. Split test_dag.py into test_classification.py (154 cases) and test_scheduling.py (13 non-trivial cases), inline VTL for simple tests, and delete unused reference/VTL files.

- Add visit_EvalOp to mark eval operands as dataset_inputs - Classify scalar-only parameters: random index, round/trunc/substr/ replace/instr params, between from/to bounds - Add tests 161-171 covering eval, random, and scalar-constrained params

…ests Fix bug where a variable appearing in both dataset_inputs and scalar_inputs across statements was classified as dataset instead of dataset_or_scalar. Add 15 edge case tests (172-186) covering scalar chain propagation, unknown variable resolution, nested operators, deeply nested expressions, and if-then-else with dataset-only branches.

javihern98 · 2026-03-12T11:00:08Z

This one is posponed until further notice due to the possible impact and clarification on how we could benefit from this performance advantages, and if this could be used as well in the duckdb process. Scheduled to review again in end days of April

javihern98 added 11 commits March 5, 2026 16:06

feat: add has_dataset_op to StatementDeps and four category fields to…

674a1a5

… DatasetSchedule

feat: update consumers to use categorized global input fields

23a2fe6

refactor: extract helper methods to reduce complexity in DAG analysis

5d9405f

feat: fix dataset classification for UDO calls and MEMBERSHIP operator

5aca4ad

Set _current_has_dataset_op in visit_ParamOp, visit_UDOCall, and visit_BinOp (MEMBERSHIP) so UDO dataset parameters and membership operands are correctly classified as global_input_datasets.

chore: add types-networkx to dev dependencies

978f220

docs: add docstrings to DAG models and filter global inputs by all_ou…

a56f90f

…tputs

refactor: rename DatasetSchedule to Schedule

d2f67a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classify dataset vs scalar inputs in DAG Schedule#573

Classify dataset vs scalar inputs in DAG Schedule#573
javihern98 wants to merge 11 commits intomainfrom
cr-571

javihern98 commented Mar 5, 2026

Uh oh!

javihern98 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javihern98 commented Mar 5, 2026

Summary

Checklist

Impact / Risk

Notes

Uh oh!

javihern98 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant