Skip to content

feat: add DocsFilter#371

Open
forshev wants to merge 1 commit intomainfrom
290-docs-filter
Open

feat: add DocsFilter#371
forshev wants to merge 1 commit intomainfrom
290-docs-filter

Conversation

@forshev
Copy link
Copy Markdown
Collaborator

@forshev forshev commented Mar 5, 2026

Description

Added DocsFilter struct. Not in use yet


  • I have read and followed all requirements in CONTRIBUTING.md;
  • I used LLM/AI assistance to make this pull request;

If you have used LLM/AI assistance please provide model name and full prompt:

Model: {{model-name}}
Prompt: {{prompt}}

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 32.35908% with 324 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.23%. Comparing base (1acfc59) to head (98257ad).

Files with missing lines Patch % Lines
docsfilter/docs_filter.go 0.00% 206 Missing ⚠️
docsfilter/encoding.go 78.80% 17 Missing and 15 partials ⚠️
util/fs.go 38.77% 17 Missing and 13 partials ⚠️
docsfilter/filter.go 0.00% 16 Missing ⚠️
asyncsearcher/async_searcher.go 22.22% 2 Missing and 5 partials ⚠️
frac/active.go 25.00% 6 Missing ⚠️
frac/active_index.go 0.00% 6 Missing ⚠️
frac/remote.go 0.00% 6 Missing ⚠️
frac/active_lids_map.go 66.66% 5 Missing ⚠️
frac/sealed.go 0.00% 4 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #371      +/-   ##
==========================================
- Coverage   71.29%   70.23%   -1.06%     
==========================================
  Files         206      210       +4     
  Lines       15156    15581     +425     
==========================================
+ Hits        10805    10943     +138     
- Misses       3565     3837     +272     
- Partials      786      801      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

🔴 Performance Degradation

Some benchmarks have degraded compared to the previous run.
Click on Show table button to see full list of degraded benchmarks.

Show table
Name Previous Current Ratio Verdict
AggDeep/size=10000-4 ad0347 e19aad
47818.00 ns/op 54318.00 ns/op 1.14 🔴
AggWide/size=1000-4 ad0347 e19aad
4780.00 ns/op 5615.00 ns/op 1.17 🔴
And/size=1000-4 ad0347 e19aad
4.49 ns/op 5.14 ns/op 1.14 🔴

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

🔴 Performance Degradation

Some benchmarks have degraded compared to the previous run.
Click on Show table button to see full list of degraded benchmarks.

Show table
Name Previous Current Ratio Verdict
AggDeep/size=1000-4 ad0347 6ba4f4
4737.00 ns/op 5455.00 ns/op 1.15 🔴
AggWide/size=1000-4 ad0347 6ba4f4
4780.00 ns/op 5329.00 ns/op 1.11 🔴
FindSequence_Random/medium-4 ad0347 6ba4f4
12292.95 MB/s 9431.80 MB/s 0.77 🔴
86.31 ns/op 108.60 ns/op 1.26 🔴

@eguguchkin eguguchkin added docs_filter feature New feature or request labels Mar 13, 2026
@eguguchkin eguguchkin added this to the v0.70.0 milestone Mar 16, 2026
@eguguchkin eguguchkin requested review from cheb0 and eguguchkin March 16, 2026 09:42
Filtering Filter `config:"filtering"`
DocsFilter struct {
DataDir string `config:"data_dir"`
Concurrency int `config:"concurrency"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's not possible to understand what concurrency means without reading code.

Besides, other props are named like reader_workers, replay_workers, etc

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: filter_workers? workers?


for _, p := range params {
f := NewFilter(p)
filtersMap[string(f.Hash())] = f
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cast to string is not needed?


type Params struct {
Query string
From int64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when migrating to nanos, I learned the hard way that variables with name From/To/Interval and type i64 are source of potential errors since you don't know if it's millis or nanos or etc. Not to say about additional cognitive load you have when look at them.

So, if nothing prevents it, the faster you convert to seq.MID the better.

type lidsBlockHeader struct {
Codec lidsCodec
Length uint32 // Number of LIDs in block
MinLID uint32
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: haven't looked at other PR yet, but do we really need to export all fields here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we could avoid storing min here and instead calculate it as the max value from the previous block?

"github.com/ozontech/seq-db/seq"
)

type ActiveLIDs struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: file named active_lids_map.go

for i := range numberOfBlocks {
dst, src, err = unmarshalLIDsBlock(dst, src, headers[i])
if err != nil {
return dst, src, err
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I noticed you return other variables even if err is not nil. Do we have a reason to do this in Go?

CacheSizeLimit uint64
}

type DocsFilter struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would personally named it FilterManager (same like we have FracManager). I think it took me ~10 minutes of scrolling through code to realize there are Filter and DocsFilter structs. They are not the same and have different scope of responsibility.
I mean, this struct manages and updates filters, but it doesn't filter anything itself.

// RefreshFrac replaces frac's tombstone files with newly found results. Used after active frac is sealed.
func (df *DocsFilter) RefreshFrac(fraction frac.Fraction) {
df.fracsMu.RLock()
fracsFiles, has := df.fracs[fraction.Info().Name()]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's say I added a filter and started seq-db. Now, there might be two concurrent processes:

  • running and updating a filter for active frac
  • sealing same frac. Techincally, sealing is slow, but we don't know for sure. After sealing, I assume, this method RefreshFrac will be called and update filter again.

It seems, those processes do not have any synchronization between them. Problems will probably never arise, but, it seems, to me, we need some kind of way to serialize filter update for same frac (within same filter of course).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is actually a design issue in the current MVP. With the active faction in place, we’re facing a number of problems, and we’ll need to figure out how to address this technical debt.

config Config
filters map[string]*Filter

fracs map[string][]string
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you explain why we have a single collections of fracs across multiple Filter? It would be intuitive for me to have such kind of collection for every Filter.

diskUsage = promauto.NewGauge(prometheus.GaugeOpts{
Namespace: "seq_db_store",
Subsystem: "filters",
Name: "disk_usage_bytes",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I’m not mistaken, the Help field must be specified

return nil
}

// createDataDir creates dir data lazily to avoid creating extra folders.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we’re creating the directory immediately at startup — not lazily, and without concurrency. In this case, using sync.Once is unnecessary.

}
}

func MustWriteFileAtomic(fpath string, data []byte, tmpFileExt string) {
Copy link
Copy Markdown
Collaborator

@eguguchkin eguguchkin Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It appears that the atomicWrite function in the faction manager performs the same operations. If so, should we merge these functions?

}

// buildQueue creates a directory for each of unprocessed filters and creates .queue files
func (df *DocsFilter) buildQueue(fracs fracmanager.List) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queue implementation is not fault‑tolerant: if the server crashes mid‑process, we’ll end up with an incomplete queue.

Let’s try the following approach:

  • write all data to a temporary directory within the same folder;
  • then atomically rename (with fsync of parent dir) the directory once the operation is complete.

return nil
}

func (df *DocsFilter) maintenance() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d use a context and a WaitGroup here to ensure all goroutines are properly terminated

See:

func startStatsWorker(ctx context.Context, reg *fractionRegistry, wg *sync.WaitGroup) {

defer lidsBlockBufPool.Put(b)

numberOfBlocks := (len(in) + maxLIDsBlockLen - 1) / maxLIDsBlockLen
dst = binary.BigEndian.AppendUint32(dst, uint32(numberOfBlocks))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, we’ve been using Little‑Endian across the codebase. As a result, this file is the only one that uses a different encoding, which breaks consistency in the code

Length uint32 // Number of LIDs in block
MinLID uint32
MaxLID uint32
Size uint32 // Size of ids block in bytes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could also store the size of the uncompressed block — this would allow more efficient memory allocation during decompression.


dst[0] = byte(h.Codec)
dst = dst[1:]
binary.BigEndian.PutUint32(dst, h.Length)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using AppendUint32 will make the code more compact

type lidsBlockHeader struct {
Codec lidsCodec
Length uint32 // Number of LIDs in block
MinLID uint32
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we could avoid storing min here and instead calculate it as the max value from the previous block?

// RefreshFrac replaces frac's tombstone files with newly found results. Used after active frac is sealed.
func (df *DocsFilter) RefreshFrac(fraction frac.Fraction) {
df.fracsMu.RLock()
fracsFiles, has := df.fracs[fraction.Info().Name()]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is actually a design issue in the current MVP. With the active faction in place, we’re facing a number of problems, and we’ll need to figure out how to address this technical debt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs_filter feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants