[RFC] PPL `cluster` Command

## Problem Statement

Log analysis frequently requires identifying patterns in event data like what types of errors are occurring, which messages are common vs rare, and what distinct categories of events exist. Manually reviewing thousands of log lines is impractical. Users need a way to automatically group similar log events together to quickly triage issues and understand the shape of their data. Expected use cases involve filtering before clustering like clustering error logs, grouping similar events by sourcetype to understand what patterns exist, or finding rare events during an incident time window.
 
## Syntax

```
cluster <field> [t=threshold] [match=mode] [labelfield=name] [countfield=name] [labelonly=bool] [showcount=bool] [delims=chars]
```

**Parameters:**

- **field**: Field to cluster on (required, no default field to fall back to)
- **t**: Similarity threshold between 0.0-1.0 (default: 0.8)
- **match**: Vectorization mode - `termlist` (positional), `termset` (bag-of-words), or `ngramset` (character trigrams) (default: `termlist`)
- **labelfield**: Name for cluster label output field (default: `cluster_label`)
- **countfield**: Name for cluster count output field (default: `cluster_count`)
- **labelonly**: If true, only output cluster labels, not counts (default: true)
- **showcount**: If true, include cluster counts in output (default: true)
- **delims**: Delimiter characters for tokenization (default: everything except [0-9A-Za-z_])

## Text Similarity Approach

Uses lexical (token-based) similarity rather than semantic similarity. Logs are structurally predictable, making lexical matching sufficient and cheaper. Semantic can also be counterproductive , logs about similar issues on different services would embed closely but may be unrelated.

### Vectorization Modes
- termlist (default): Ordered token comparison. Fastest. Two events must share tokens in the same positions.
- termset: Unordered token comparison. Catches similarity when word order varies.
- ngramset: Character trigram comparison. Significantly slower.

## Clustering Algorithm

Greedy single-pass clustering. Each event is compared against existing cluster representatives. If similarity exceeds threshold `t`, the event joins that cluster. Otherwise a new cluster is created.

- Time: O(n×k), Space: O(k×d) where k=clusters, d=dimensions
- Order-dependent results (same data in different order can produce different clusters)
- Only stores cluster representatives, not full similarity matrix

### Processing Flow

1. Collect events in batches (following LogPatternAggFunction pattern)
2. Process each batch against existing global cluster state
3. Compare new events against cluster representatives only
4. Assign to existing cluster or create new one based on threshold

### Other Algorithm Options Considered

**Union-Find Connected Components**
- Compare all document pairs, union those above threshold
- Find globally optimal connected components
- Time: O(n²×α(n)), Space: O(n)
- Order-independent, but much slower for large datasets

**Hierarchical Agglomerative Clustering**
- Build distance matrix, iteratively merge closest clusters
- High quality results but O(n²) space and O(n³) time
- Impractical for large log datasets

## Comparison to existing patterns command
patterns uses the Brain log parser algorithm which works fundamentally differently.
Instead of comparing every document against cluster representatives (O(n×k) similarity
comparisons), it:

1. Tokenizes each log line
2. Identifies which tokens are variable (numbers, IDs, timestamps) vs. static
3. Replaces variable tokens with wildcards to produce a pattern template
4. Groups documents by exact template match (a simple hash map lookup — O(1) per document)

So patterns is O(n) total because each document is processed once, independently. No pairwise comparisons are needed. And the merge across batches is trivial because two documents with the same template always produce the same template string.

We can't use this approach for cluster because:
- patterns produces rigid templates — two messages must share the exact same static tokens to group together. cluster groups by fuzzy similarity with a configurable threshold.
- patterns only works well for structured log messages with clear variable/static parts while cluster works on any text.
- patterns doesn't support different similarity modes (termlist, termset, ngramset) or configurable thresholds.

Cluster merges batches by comparing representatives against each other using similarity
computation. There's no deterministic key to merge on because cluster membership is fuzzy (threshold-based), not exact. This is why cluster can't parallelize across shards the way patterns can.

## Implementation Plan

### Key Components

**1. AST Node (`core/src/main/java/org/opensearch/sql/ast/tree/Cluster.java`)**
- Store command parameters: field, threshold, match mode, output options
- Integrate with visitor pattern

**2. Clustering Engine (`common/src/main/java/org/opensearch/sql/common/cluster/TextSimilarityClustering.java`)**
- Three vectorization modes: termlist (positional), termset (bag-of-words), ngramset (character trigrams)
- Cosine similarity computation using `commons-text`
- Vector caching for performance

**3. Aggregate Function (`core/src/main/java/org/opensearch/sql/calcite/udf/udaf/ClusterLabelAggFunction.java`)**
- Buffered processing following LogPatternAggFunction pattern
- Thread-safe accumulator with synchronized collections
- Incremental cluster state management across batches

**4. Grammar Integration (`ppl/src/main/antlr/OpenSearchPPLParser.g4`, `OpenSearchPPLLexer.g4`)**
- Add CLUSTER_CMD token and clusterCommand rule
- Support parameters: t, match, field, labelfield, countfield, labelonly, showcount, delims

## Risks and Considerations

**Performance**: O(n×k) algorithm may struggle with datasets having many small clusters. Monitor cluster count growth.

**Memory**: Buffering strategy prevents unbounded growth, but large buffers still require significant memory.

**Thread Safety**: Synchronization overhead may impact performance in highly concurrent scenarios.

**Order Dependency**: Results depend on event processing order, which may surprise users expecting deterministic clustering.

## Limitations 

### Required Field
The `field` parameter is required. There is no default _raw text field in OpenSearch to fall back to like SPL, so users must always specify which field to cluster on.

### High Cardinality Performance
The clustering algorithm is O(n×k) where k is the number of distinct clusters. When k is small relative to n, performance scales linearly . 1M events with ~15 clusters completes in ~8s. When k approaches n (most events are unique), performance degrades: 10K events with 10K clusters takes ~23s, 20K takes ~4 minutes. A max cluster cap of 10,000 is enforced to bound memory usage.

## Other Approaches Considered
1. Fingerprint analyzer + terms aggregation
- Index the field with a fingerprint analyzer → identical token sets get the same fingerprint → terms agg groups them server-side
- Pros: fully distributed, zero post-processing for exact-match grouping
- Cons: only groups exact token-set matches, no fuzzy/threshold-based similarity. Requires index-time setup. Doesn't work for eval'd fields.

2. min_hash token filter + terms aggregation
- Index with min_hash filter → similar documents share hash buckets → group by hash bucket
- Pros: distributed, approximate similarity grouping, tunable via hash count
- Cons: requires index-time setup. Doesn't work for eval'd fields. Grouping by hash bucket is coarse, not the same as threshold-based clustering.

3. ML Commons k-means with COSINE distance
- Vectorize text → store as numeric fields → run k-means
- Pros: distributed, built-in, supports cosine distance, auto-optimizes centroids
- Cons: requires fixed k upfront (cluster command discovers k dynamically). Only accepts dense numeric vectors but our term frequency vectors are sparse and converting to dense format would be prohibitively expensive in memory.

4. Text embedding + k-NN
- Deploy embedding model → generate dense vectors at ingest → use k-NN for similarity
- Pros: high quality semantic similarity, distributed search
- Cons: heavy infrastructure (model deployment, ingest pipeline). k-NN finds neighbors, doesn't cluster. Would need iterative queries to build clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] PPL `cluster` Command #5255

Problem Statement

Syntax

Text Similarity Approach

Vectorization Modes

Clustering Algorithm

Processing Flow

Other Algorithm Options Considered

Comparison to existing patterns command

Implementation Plan

Key Components

Risks and Considerations

Limitations

Required Field

High Cardinality Performance

Other Approaches Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] PPL cluster Command #5255

Description

Problem Statement

Syntax

Text Similarity Approach

Vectorization Modes

Clustering Algorithm

Processing Flow

Other Algorithm Options Considered

Comparison to existing patterns command

Implementation Plan

Key Components

Risks and Considerations

Limitations

Required Field

High Cardinality Performance

Other Approaches Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] PPL `cluster` Command #5255