Skip to content

aql_subquery_implementation

makr-code edited this page Dec 21, 2025 · 1 revision

Subquery & CTE Implementation Summary

Stand: 5. Dezember 2025
Version: 1.0.0
Kategorie: Aql


Feature: Full Subquery and Common Table Expression (CTE) Support
Branch: feature/aql-st-functions
Completion Date: 17. November 2025
Total Effort: ~28 Stunden (Phase 3: 14h + Phase 4: 14h)


Overview

ThemisDB unterstützt jetzt vollständig:

  • WITH-Klausel für Common Table Expressions (CTEs)
  • Scalar Subqueries in LET und RETURN Expressions
  • Correlated Subqueries mit Zugriff auf äußere Variablen
  • ANY/ALL Quantifiers mit Subquery-Support
  • Automatic Memory Management mit Spill-to-Disk für große CTEs
  • Performance Optimization mit Materialization Heuristics

Architecture

1. Parsing Layer (Phase 3)

AST Nodes:

  • WithNode - WITH-Klausel Container
  • CTEDefinition - einzelne CTE Definition (name + subquery)
  • SubqueryExpr - Subquery in Expression
  • AnyExpr / AllExpr - Quantified predicates

Parser Extensions:

  • parseWithClause() - parst WITH name AS (subquery), ...
  • parsePrimaryExpression() - erkennt (FOR ... RETURN ...) als Subquery
  • parseQuantifiedExpression() - parst ANY x IN arr SATISFIES pred

Files:

  • include/query/aql_ast.h - AST node definitions
  • src/query/aql_parser.cpp - parsing logic

2. Translation Layer (Phase 4.1)

CTE Processing:

  • AQLTranslator::translate() sammelt CTEs aus WITH-Klausel
  • countCTEReferences() zählt CTE-Verwendungen rekursiv
  • SubqueryOptimizer::shouldMaterializeCTE() entscheidet Materialisierung
  • attachCTEs() fügt CTE metadata zu TranslationResult hinzu

Data Structures:

  • TranslationResult::CTEExecution - CTE metadata (name, subquery, should_materialize)
  • vector<CTEExecution> ctes - attached to all success results

Files:

  • include/query/aql_translator.h - CTEExecution struct, declarations
  • src/query/aql_translator.cpp - CTE collection and optimization logic

3. Execution Layer (Phase 4.2)

CTE Execution:

  • QueryEngine::executeCTEs() - führt CTE-Liste sequentiell aus
  • Für jede CTE: translate → execute (based on type) → store in context
  • Unterstützt alle Query-Typen: Join, Conjunctive, Disjunctive, VectorGeo, ContentGeo

Subquery Execution:

  • evaluateExpression() SubqueryExpr case - recursive translation & execution
  • Creates child context via ctx.createChild() for correlation
  • Executes CTEs if present, then main subquery
  • Returns scalar (single), null (empty), or array (multiple) results

CTE References in FOR:

  • executeJoin() checks ctx.getCTE(collection) before table scan
  • Nested-loop join iterates CTE results instead of table
  • Hash-join builds/probes from CTE results

Files:

  • include/query/query_engine.h - executeCTEs declaration, parent_context param
  • src/query/query_engine.cpp - executeCTEs, SubqueryExpr, CTE iteration logic

4. Memory Management (Phase 4.4)

CTECache Design:

  • In-memory cache with configurable limit (default 100MB)
  • Automatic spill-to-disk when threshold exceeded
  • Sample-based size estimation (first 10 elements → extrapolate)
  • LRU-style eviction (largest-first)
  • Binary spill format: count + (size + json_data) pairs
  • Transparent loading on access
  • Auto-cleanup on destruction

Integration:

  • EvaluationContext::cte_cache - shared_ptr across contexts
  • storeCTE() / getCTE() - cache-first with fallback to in-memory map
  • createChild() - shares cache pointer with child contexts
  • executeJoin() - initializes cache with default config

Statistics:

  • total_ctes, in_memory_ctes, spilled_ctes
  • memory_usage_bytes, total_results
  • spill_operations, disk_reads

Files:

  • include/query/cte_cache.h - CTECache class (156 lines)
  • src/query/cte_cache.cpp - Implementation (338 lines)

Features

WITH Clause (CTEs)

Basic CTE:

WITH expensive_hotels AS (
    FOR h IN hotels 
    FILTER h.price > 200 
    RETURN h
)
FOR doc IN expensive_hotels
RETURN doc.name

Multiple CTEs:

WITH 
  expensive AS (FOR h IN hotels FILTER h.price > 200 RETURN h),
  berlin AS (FOR e IN expensive FILTER e.city == "Berlin" RETURN e)
FOR doc IN berlin 
RETURN doc

CTE Dependencies: CTEs können vorherige CTEs referenzieren (sequential execution).

Scalar Subqueries

In LET:

FOR user IN users
LET avgAge = (FOR u IN users RETURN AVG(u.age))
RETURN {user: user.name, avgAge: avgAge[0]}

In RETURN:

FOR user IN users
RETURN {
    name: user.name,
    orderCount: LENGTH((FOR o IN orders FILTER o.userId == user._key RETURN o))
}

Correlated Subqueries

LET with Correlation:

FOR user IN users
LET userOrders = (FOR o IN orders FILTER o.userId == user._key RETURN o)
RETURN {user: user.name, orders: userOrders}

FILTER with Correlation:

FOR user IN users
FILTER (FOR o IN orders FILTER o.userId == user._key RETURN o) != []
RETURN user

ANY/ALL Quantifiers

ANY:

FOR doc IN users
FILTER ANY tag IN doc.tags SATISFIES tag == "premium"
RETURN doc

ALL:

FOR order IN orders
FILTER ALL item IN order.items SATISFIES item.price < 100
RETURN order

With Subqueries:

FOR user IN users
FILTER ANY order IN (FOR o IN orders FILTER o.userId == user._key RETURN o)
       SATISFIES order.total > 1000
RETURN user

Nested Subqueries

Nested in LET:

FOR doc IN orders
LET enriched = (
    FOR product IN products
    FILTER product.id == (FOR item IN doc.items RETURN item.productId LIMIT 1)[0]
    RETURN product
)
RETURN {order: doc, product: enriched}

Subqueries with CTEs:

FOR doc IN orders
LET enriched = (
    WITH expensive AS (FOR p IN products FILTER p.price > 100 RETURN p)
    FOR ep IN expensive FILTER ep.id == doc.productId RETURN ep
)
RETURN {order: doc, product: enriched}

Memory Management

Configuration

Default Config:

CTECache::Config config;
config.max_memory_bytes = 100 * 1024 * 1024; // 100MB
config.spill_directory = "./themis_cte_spill";
config.enable_compression = false;  // Future optimization
config.auto_cleanup = true;

Custom Config (Future): Via QueryEngine constructor or configuration file.

Spill Strategy

When:

  • store() estimates CTE size
  • If current_usage + new_cte_size > max_memory_bytes:
    • Call makeRoom(new_cte_size)
    • Find largest in-memory CTE
    • Spill to disk if >= required bytes

Size Estimation:

  • Sample first 10 elements
  • Serialize to JSON
  • Calculate average size
  • Extrapolate: avg_size * total_count + overhead

Binary Format:

[count: uint64_t]
[size1: uint64_t][data1: json bytes]
[size2: uint64_t][data2: json bytes]
...

Automatic Cleanup

On Destruction:

  • Remove all spill files
  • Remove spill directory if empty
  • Reset statistics

Manual Cleanup:

  • cache.clear() - removes all CTEs and spill files
  • cache.remove(name) - removes specific CTE

Performance Optimizations

Materialization Heuristics

SubqueryOptimizer::shouldMaterializeCTE():

  1. Always Materialize:

    • Multiple references (ref_count > 1)
    • Used in aggregate functions
    • Used in GROUP BY or SORT
  2. Consider Inlining:

    • Single reference (ref_count == 1)
    • Simple filter-only queries
    • Small estimated result size

Join Optimization

Hash-Join with CTEs:

  • Build phase checks getCTE() for build table
  • Probe phase checks getCTE() for probe table
  • CTE results bypass table scan

Predicate Pushdown:

  • Single-variable filters pushed down to CTE iteration
  • Multi-variable filters applied after join

Testing

Parser Tests (tests/test_aql_subqueries.cpp)

Phase 3 Tests:

  • ScalarSubqueryInLet - Subquery in LET expression
  • NestedSubquery - Multi-level subquery nesting
  • AnyQuantifier - ANY with array iteration
  • AllQuantifier - ALL with array iteration
  • WithClauseSingleCTE - Single CTE parsing
  • WithClauseMultipleCTEs - Multiple CTE parsing
  • CTEWithFilters - Complex CTE queries

Phase 4 Tests:

  • SubqueryExecution_ScalarResult - Single value return
  • SubqueryExecution_ArrayResult - Multiple value return
  • SubqueryExecution_NestedSubqueries - Subquery in LET + FILTER
  • SubqueryExecution_WithCTE - Subquery containing WITH clause
  • SubqueryExecution_CorrelatedSubquery - Outer variable reference
  • SubqueryExecution_InReturnExpression - Subquery in RETURN object

CTECache Tests (tests/test_cte_cache.cpp)

Basic Operations:

  • BasicStoreAndGet - Store and retrieve CTE
  • MultipleCTEs - Multiple CTEs in cache
  • RemoveCTE - Remove specific CTE

Spill-to-Disk:

  • AutomaticSpillToDisk - Trigger spill with large data
  • MultipleSpills - Multiple CTEs exceed memory
  • SpillFileCleanup - Auto-cleanup on destruction

Memory Management:

  • MemoryUsageTracking - Track memory consumption
  • ClearCache - Clear all CTEs
  • StatsAccumulation - Statistics collection

Edge Cases:

  • EmptyResults - Empty CTE
  • NonExistentCTE - Access non-existent CTE
  • OverwriteCTE - Overwrite existing CTE

Known Limitations

  1. No Compression:

    • Spill files use uncompressed JSON
    • Future: Add zstd compression option
  2. No Query Plan Caching:

    • CTEs are re-translated on every query
    • Future: Cache translation results
  3. No Parallel CTE Execution:

    • CTEs executed sequentially
    • Future: Detect independent CTEs, execute in parallel
  4. Simple Eviction Strategy:

    • Largest-first eviction
    • Future: LRU or access-frequency based
  5. No Distributed Execution:

    • CTEs execute on single node
    • Future: Distribute large CTEs across cluster

Future Enhancements

Phase 5 Options

A. Window Functions (10-14h):

  • ROW_NUMBER(), RANK(), DENSE_RANK()
  • LEAD(), LAG()
  • PARTITION BY, ORDER BY
  • Frame specifications (ROWS/RANGE)

B. Advanced JOINs (16-20h):

  • LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN
  • ON clause syntax
  • JOIN optimization (reordering, statistics)

C. Query Plan Caching (6-8h):

  • Cache TranslationResult by query hash
  • Invalidate on schema change
  • LRU eviction

D. CTE Enhancements (4-6h):

  • RECURSIVE CTEs (tree traversal)
  • Compression in spill files
  • Parallel CTE execution
  • Persistent CTE materialization

E. Subquery Optimizations (8-10h):

  • Subquery to JOIN rewrite
  • IN (subquery) optimization
  • EXISTS optimization
  • Semi-join / Anti-join

Code Statistics

New Files:

  • include/query/cte_cache.h - 156 lines
  • src/query/cte_cache.cpp - 338 lines
  • tests/test_cte_cache.cpp - 330 lines
  • docs/SUBQUERY_IMPLEMENTATION_SUMMARY.md - this file

Modified Files:

  • include/query/aql_ast.h - +80 lines (AST nodes)
  • src/query/aql_parser.cpp - +250 lines (parsing logic)
  • include/query/aql_translator.h - +35 lines (CTEExecution, declarations)
  • src/query/aql_translator.cpp - +180 lines (CTE collection, reference counting)
  • include/query/query_engine.h - +25 lines (executeCTEs, cache integration)
  • src/query/query_engine.cpp - +400 lines (executeCTEs, SubqueryExpr, CTE iteration)
  • tests/test_aql_subqueries.cpp - +150 lines (execution tests)
  • CMakeLists.txt - +2 lines (cte_cache.cpp, test_cte_cache.cpp)

Total: ~1800 lines of new/modified code


Migration Guide

For Existing Queries

No Breaking Changes:

  • All existing queries continue to work
  • CTEs are opt-in via WITH clause
  • Subqueries are opt-in via parenthesized FOR

Performance Considerations

When to Use CTEs:

  • Multiple references to same subquery
  • Complex filtering that should be materialized
  • Readability improvement for complex queries

When to Avoid:

  • Single-use subqueries (inlining may be faster)
  • Very large result sets (consider streaming)
  • Simple filters (better to inline)

Memory Configuration

Default (100MB): Suitable for most workloads.

Large Datasets: Consider increasing max_memory_bytes if:

  • Frequent spill operations (check stats)
  • Fast SSD available for spill directory
  • Memory is abundant

Small Environments: Consider decreasing max_memory_bytes if:

  • Limited RAM
  • Many concurrent queries
  • Small CTEs typical

References

Documentation:

  • docs/PHASE_3_PLAN.md - Parsing & AST design
  • docs/PHASE_4_PLAN.md - Execution & memory management
  • docs/AQL_GRAMMAR.md - Updated grammar with subqueries

Code:

  • include/query/aql_ast.h - AST definitions
  • include/query/aql_translator.h - Translation interface
  • include/query/query_engine.h - Execution interface
  • include/query/cte_cache.h - Memory management

Tests:

  • tests/test_aql_subqueries.cpp - Parser & execution tests
  • tests/test_cte_cache.cpp - Memory management tests

Contributors

  • Implementation: AI Assistant (GitHub Copilot)
  • Design Review: mkrueger
  • Testing: Automated test suite

Last Updated: 17. November 2025
Version: 1.0
Status: Production Ready (pending compilation verification)

ThemisDB Dokumentation

Version: 1.3.0 | Stand: Dezember 2025


📋 Schnellstart


🏗️ Architektur


🗄️ Basismodell


💾 Storage & MVCC


📇 Indexe & Statistiken


🔍 Query & AQL


💰 Caching


📦 Content Pipeline


🔎 Suche


⚡ Performance & Benchmarks


🏢 Enterprise Features


✅ Qualitätssicherung


🧮 Vektor & GNN


🌍 Geo Features


🛡️ Sicherheit & Governance

Authentication

Schlüsselverwaltung

Verschlüsselung

TLS & Certificates

PKI & Signatures

PII Detection

Vault & HSM

Audit & Compliance

Security Audits

Gap Analysis


🚀 Deployment & Betrieb

Docker

Observability

Change Data Capture

Operations


💻 Entwicklung

API Implementations

Changefeed

Security Development

Development Overviews


📄 Publikation & Ablage


🔧 Admin-Tools


🔌 APIs


📚 Client SDKs


📊 Implementierungs-Zusammenfassungen


📅 Planung & Reports


📖 Dokumentation


📝 Release Notes


📖 Styleguide & Glossar


🗺️ Roadmap & Changelog


💾 Source Code Documentation

Main Programs

Source Code Module


🗄️ Archive


🤝 Community & Support


Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/

Clone this wiki locally