feat(cms): add Count-Min Sketch data structure#3405
feat(cms): add Count-Min Sketch data structure#3405Tangruilin wants to merge 1 commit intoapache:unstablefrom
Conversation
ce091bf to
71b0ba8
Compare
71b0ba8 to
a80940f
Compare
|
I noticed that operations like ZINCRBY (ZSet) and HINCRBY (Hash) don't use any locks, even though they are read-modify-write operations. From my testing with concurrent unit tests, I Questions about Kvrocks threading model:
Current locking status across types: | Type | INCRBY has lock | Notes | I've included concurrent tests for CMS in tests/gocase/unit/type/cms/cms_test.go to verify correctness. Would appreciate clarification on whether locks are needed for CMS operations. |
6808595 to
4b0fa17
Compare
|
AI was used in this patch for specific code implementation and approach analysis. The main design was evaluated by humans, discussion refer:#3404 |
tests/cppunit/types/cms_test.cc
Outdated
| } | ||
|
|
||
| void TearDown() override { | ||
| TestBase::SetUp(); |
There was a problem hiding this comment.
error. should be TearDown
|
|
||
| /// CMS.INCRBY - Increment counters for given items | ||
| /// Usage: CMS.INCRBY key item increment [item increment ...] | ||
| rocksdb::Status CMS::IncrBy(engine::Context &ctx, const Slice &key, |
There was a problem hiding this comment.
I assessed that a lock is needed here. However, I ran concurrent tests and also reviewed Kvrocks' HSET, HASH and other modules. I found that their write operations don't use locks. Is this by design? Is concurrency handled at some other level?
any suggestion? @PragmaTwice @git-hulk
There was a problem hiding this comment.
@Tangruilin Thanks for your PR. I will take a look at your proposal first, you could read Guidelines for AI-assisted Contributions.
There was a problem hiding this comment.
There was a problem hiding this comment.
Hi @git-hulk,
Thank you for taking the time to review this PR.
For this patch, you could focus on reviewing the proposal-related parts first. I'll address any feedback you have on the design and implementation approach.
Regarding the concurrency concern I mentioned earlier — I noticed that commands like HSET, HINCRBY, and ZINCRBY in Kvrocks don't use explicit locks, even though they involve
read-modify-write operations. Is this by design? Is thread safety handled at a different level (e.g., command execution model)?
src/types/redis_cms.cc
Outdated
| // Formula from Count-Min Sketch paper: | ||
| // width = ceil(e / error_rate) | ||
| // depth = ceil(ln(1 - probability) / ln(1 - 1/e)) | ||
| auto width = static_cast<uint32_t>(std::ceil(2.718281828 / error_rate)); |
There was a problem hiding this comment.
use std::numbers::e
| return static_cast<uint32_t>(hash % width); | ||
| } | ||
|
|
||
| std::pair<uint32_t, uint32_t> CMS::calcDimFromProb(double error_rate, double probability) { |
There was a problem hiding this comment.
probability use success probability here.
should be fail_probability
refer to: https://redis.io/docs/latest/commands/cms.initbyprob/
src/types/redis_cms.cc
Outdated
| // Formula from Count-Min Sketch paper: | ||
| // width = ceil(e / error_rate) | ||
| // depth = ceil(ln(1 - probability) / ln(1 - 1/e)) | ||
| auto width = static_cast<uint32_t>(std::ceil(2.718281828 / error_rate)); |
There was a problem hiding this comment.
RedisBloom uses ceil(2 / error) for the width calculation. I believe this is because C doesn't have a built-in constant for e (Euler's number). In our C++ implementation, using e would
be more mathematically accurate according to the original CMS paper.
Should we follow the paper's formula with e, or should we fully match Redis's implementation for compatibility?
Implement CMS (Count-Min Sketch) probabilistic data structure with the following commands: - CMS.INITBYDIM key width depth - CMS.INITBYPROB key error probability - CMS.INCRBY key item increment [...] - CMS.QUERY key item [...] - CMS.MERGE dest numkeys src... [WEIGHTS ...] - CMS.INFO key Storage design: - Metadata stored in Metadata CF - Count Matrix stored in PrimarySubkey CF (per-bucket storage) - Uses MurmurHash64 for layer-specific hashing Related: apache#2425
4b0fa17 to
c11c84e
Compare
|
Hi @Tangruilin Kvrocks' write commands have key-level locks at the framework level. |
Thanks for the review! I'll split this PR. The first PR will focus on the basic CMS and INIT commands only. I've marked this PR as draft for now. Once the split is complete, I'll change it back to ready for review. In the meantime, you could review the proposal first: #3404 |
Implement CMS (Count-Min Sketch) probabilistic data structure with the following commands:
Storage design:
Related: #2425