Classify dataset vs scalar inputs in DAG Schedule#573
Draft
javihern98 wants to merge 11 commits intomainfrom
Draft
Classify dataset vs scalar inputs in DAG Schedule#573javihern98 wants to merge 11 commits intomainfrom
javihern98 wants to merge 11 commits intomainfrom
Conversation
… propagation Track dataset operations per statement (has_dataset_op flag) and use fixed-point propagation to identify scalar outputs. Classify global inputs into: global_input_datasets, global_input_scalars, global_input_dataset_or_scalar, and global_input_component_or_scalar. Add 6 new DAG test cases (11-16) covering scalar chains, RegularAggregation ambiguity, top-level ambiguity, and mixed contexts.
Set _current_has_dataset_op in visit_ParamOp, visit_UDOCall, and visit_BinOp (MEMBERSHIP) so UDO dataset parameters and membership operands are correctly classified as global_input_datasets.
Extend DAG analysis to discriminate between datasets, scalars, dataset-or-scalar, and component-or-scalar global inputs. Each dataset-only operator now explicitly marks its direct VarID operands as dataset_inputs, so sub-expression operands (e.g., SC_1 in count(DS_1 + SC_1)) are correctly classified as ambiguous rather than forced to datasets. Split test_dag.py into test_classification.py (154 cases) and test_scheduling.py (13 non-trivial cases), inline VTL for simple tests, and delete unused reference/VTL files.
- Add visit_EvalOp to mark eval operands as dataset_inputs - Classify scalar-only parameters: random index, round/trunc/substr/ replace/instr params, between from/to bounds - Add tests 161-171 covering eval, random, and scalar-constrained params
…ests Fix bug where a variable appearing in both dataset_inputs and scalar_inputs across statements was classified as dataset instead of dataset_or_scalar. Add 15 edge case tests (172-186) covering scalar chain propagation, unknown variable resolution, nested operators, deeply nested expressions, and if-then-else with dataset-only branches.
Contributor
Author
|
This one is posponed until further notice due to the possible impact and clarification on how we could benefit from this performance advantages, and if this could be used as well in the duckdb process. Scheduled to review again in end days of April |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #571
Splits
global_inputsinto four categories inSchedule(renamed fromDatasetSchedule), withglobal_inputsas their union:global_input_datasetsglobal_input_scalarsglobal_input_dataset_or_scalarDS_r := X + 2)global_input_component_or_scalarDS_1[calc Me_2 := Me_1 + X])Key changes:
StatementDeps: Addedhas_dataset_op,dataset_inputs,scalar_inputsfields for per-statement trackingDAGAnalyzer: 21 visitor methods classify operands by operator semantics (dataset-only, dual, scalar-constrained)_resolved_from_unknown, propagates through pure-scalar chainsdataset_inputsandscalar_inputsacross statements →dataset_or_scalar_validate_extra_datasets,_extract_input_datasets,_save_datapoints_efficientChecklist
ruff format,ruff check,mypy)pytest)Impact / Risk
DatasetSchedulerenamed toSchedule; new fields added (additive).global_inputsbehavior unchanged (still the union of all categories).Schedulefields enable precise dataset vs scalar discrimination for callers.Notes