feat: comprehensive observability instrumentation across Lance subsystems#6118
feat: comprehensive observability instrumentation across Lance subsystems#6118esteban wants to merge 14 commits intolance-format:mainfrom
Conversation
PR Review: Comprehensive Observability InstrumentationOverall this is a well-structured observability addition with consistent patterns (atomic counters, typed stats structs, tracing events). A few items to consider: P1: HNSW progress counter never reports the final batchIn P1:
|
wjones127
left a comment
There was a problem hiding this comment.
Thanks for the PR, I like most of these changes. I've been meaning to migrate away from log to tracing.
In the future, it would be appreciated to keep PRs to be more focused when possible. It makes reviewing them easier. Plus it also can get individual features through review faster--less controversial changes might get held up by refinements to more controversial changes. Some of these could be their own PR:
- Cache statistics
- IO statistics
- Compaction statistics - this one I'm skeptical of it's usefulness without knowing how it will be used.
- Index search metrics
|
Thanks for the review @wjones127!
I completely understand, I've been just piling this up in a local branch as I've been found the need to get better observability. Will try to provide mode atomic and focused PRs in the future! |
…s across all layers Replace the (u64, u64) tuple return from hit_miss_counts() with a named CacheHitMiss struct (hits, misses, hit_ratio(), miss_ratio()). Surface sync cache stat accessors through Dataset and Python bindings so consumers at every layer can read hit/miss counters without async overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add comprehensive cache observability: eviction counter via moka eviction_listener, max_capacity_bytes field, utilization() method, and tracing::debug on miss/clear/invalidate_prefix paths. Expose full CacheStats (including evictions) through Python bindings via new PyCacheStats pyclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onnection reset tracing - RetryStats/RetrySnapshot on CloudObjectReader for retry/failure counters - IO scheduler: coalesced_iops, split_iops, backpressure_events in ScanStats - IoStats: per-method read/write latency_us and error counters - Object writer: tracing::warn on connection reset retries - Migrated log::warn/debug to tracing::warn/debug with structured targets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- decode_batch: Instant-based elapsed_us + num_rows tracing at DEBUG level - buffer.rs: tracing::debug for concat, zip_into_one, and misaligned-copy paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ld progress - MetricsCollector: record_bytes_loaded, record_partitions_probed, record_candidates_evaluated, record_search_duration_us (default no-op) - LocalMetricsCollector: atomic fields + dump_into forwarding for new metrics - MetricsSnapshot struct for typed point-in-time reads - HNSW builder: progress logging every 10k vectors, search strategy tracing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Stats - CompactionMetrics: bytes_rewritten, bytes_binary_copied, elapsed_ms fields with #[serde(default)] for backward compatibility - FragmentDistributionStats: count, total/min/max/avg physical_rows, fragments_with_deletions via Dataset::fragment_distribution_stats() - Migrated compaction logging from log::info to tracing::info Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- docs/src/guide/observability.md: signal taxonomy, tracing targets, all stats structs, OTel integration guide, cardinality policy - rust/examples/src/tracing_setup.rs: working example with env-filter, fragment stats, and compaction tracing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tructs, RetryStats tests - commit_transaction: tracing::warn on conflict retry, non-conflict error, and retry exhaustion with structured fields (target: lance::commit) - CacheHitMiss, CacheStats: add #[non_exhaustive] for API stability - RetryStats: add unit tests for snapshot, increment, and equality - Fix session test to avoid struct literal construction of non_exhaustive type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…crate - execute_with_retry: tracing::warn on conflict retry and exhaustion (target: lance::write::retry) with attempt, elapsed_ms fields - insert.rs: migrate log::warn/info to tracing::warn/info with structured fields - merge_insert.rs: migrate log::info to tracing::info Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- write.rs: use tracing::{debug, error, info, warn} (TaskDispatcher,
RegionWriter, backpressure, shutdown lifecycle)
- manifest.rs: use tracing::{info, warn} (manifest store operations)
- memtable/flush.rs: use tracing::info (flush lifecycle)
- memtable/scanner/builder.rs: log::warn → tracing::warn (8 sites)
- index.rs: log::warn → tracing::warn (test utilities)
- docs: update observability guide with new write/commit targets
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- list()/list_with_offset(): only increment read_iops instead of recording fake reads with zero latency/bytes - Rename backpressure_events → backpressure_warnings to reflect debounced warning count, not actual activation count - Rename elapsed_ms → elapsed_ms_sum to clarify cumulative worker time across parallel compaction tasks; add serde alias for compat - HNSW build: use thread-local counters flushed every 10k to avoid contended global atomic in inner parallel loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
list() and list_with_offset() streams are now wrapped with inspect() to increment read_errors on Err items. Previously list-path failures were invisible to IoStats, contradicting the documented "Failed read operations" semantics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_list_stream_errors_increment_read_errors: normal list, no errors - test_list_error_items_counted_as_read_errors: ErrorInjectingStore mock exercises IoTrackingStore.list() end-to-end with 3 injected Err items - test_list_with_offset_error_items_counted: list_with_offset IOPS count Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove buffer-level tracing (too low-level, overhead concern) - Replace decode timing event with debug_span (reviewer suggestion) - Remove FragmentDistributionStats (already available via public API) - Add HNSW build completion log after par_iter (final batch unreported) - Clarify latency docs include async scheduling jitter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
58f0493 to
773cdcc
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Motivation
Lance had fragmented observability: mixed
log::/tracing::usage, no structured in-memory stats for IO latency, cache hit rates, or scheduler behavior, and critical paths (write, commit retry, compaction, index build) were unobservable.Solution
Three-layer instrumentation (IO, Encoding, Dataset) using a consistent pattern:
Arc<AtomicU64>+Ordering::Relaxed, ~2ns amortized)snapshot()for point-in-time readstracingevents withtarget:fields for subscriber filtering#[non_exhaustive]on new public structs for forward compatibilityNo direct OpenTelemetry dependency — users bring their own
tracingsubscriber.Crates modified
lance-coreCacheHitMiss,CacheStats, eviction trackinglance-ioIoStats(latency, errors),ScanStats(coalesce/split/backpressure),RetryStats, connection reset tracing,list()error trackinglance-encodinglance-indexMetricsCollectorextension, HNSW build progress,MetricsSnapshotlanceCompactionMetrics(bytes, timing),FragmentDistributionStats, commit tracing, write path tracing, MemWALlog::->tracing::migrationNew stats structs
IoStatsIOTracker::stats()ScanStatsScanScheduler::stats()CacheHitMissLanceCache::cache_hit_miss()hit_ratio()RetryStatsCloudObjectReader::retry_stats()CompactionMetricscompact_files()returnFragmentDistributionStatsDataset::fragment_distribution_stats()MetricsSnapshotLocalMetricsCollector::snapshot()Backward compatibility
#[serde(default)]on new fields,#[serde(alias)]for renamesMetricsCollectortrait extensions#[non_exhaustive]on all new public structsTest plan
RetryStats,CompactionMetricsAddAssign,IoTrackingStorelist() error counting🤖 Generated with Claude Code