Skip to content

refactor(mem_wal): redesign FTS mem index for single-writer multi-reader#6726

Open
touch-of-grey wants to merge 5 commits into
lance-format:mainfrom
touch-of-grey:MemTableFTSBetter
Open

refactor(mem_wal): redesign FTS mem index for single-writer multi-reader#6726
touch-of-grey wants to merge 5 commits into
lance-format:mainfrom
touch-of-grey:MemTableFTSBetter

Conversation

@touch-of-grey
Copy link
Copy Markdown
Contributor

@touch-of-grey touch-of-grey commented May 10, 2026

Summary

Replaces the SkipMap<(token, row_position)> postings layout in
rust/lance/src/dataset/mem_wal/index/fts.rs with per-term
ArcSwap<TermSlice> slices and an ArcSwap<Snapshot>-published
per-batch visibility watermark. Readers grab the snapshot, walk per-term
chunks filtered by chunk.batch_position < snapshot.visible_count, and
score with snapshot-coupled BM25 stats.

The writer publishes the snapshot only after every term chunk for a
batch is linked, so readers never observe a partial document or BM25
stats out of sync with the postings — per-batch monotonic visibility.

Also in this PR:

  • Tokenize-time tokenizer ownership moves out of a single shared Mutex
    into a small reader pool plus a writer-dedicated slot, so search
    calls no longer serialize against the writer.
  • Utf8View is added to the supported text types alongside Utf8 /
    LargeUtf8; non-string columns now produce a clear error instead of
    being silently skipped.
  • FtsMemIndex::memory_usage() for size-based flush triggers.
  • Compound queries (Boolean, Boost) snapshot once at search_query
    and thread the same Arc<Snapshot> through every leaf, so a
    compound query never mixes BM25 stats from different snapshots.
  • The flush path keeps using lance_index::scalar::inverted types
    (TokenSet / DocSet / PostingListBuilder); the on-disk format is
    unchanged. The stale "flush is not yet implemented" header doc is
    corrected.

Public API (FtsMemIndex constructors, FtsQueryExpr, SearchOptions,
BooleanQueryBuilder, FtsIndexConfig, to_index_builder_reversed)
is preserved. No callers of the removed internal FtsKey /
PostingValue types existed outside the file.

cc @jackye1995 for review.

Replace the SkipMap<(token, row_position)> postings layout with per-term
ArcSwap<TermSlice> slices and an ArcSwap<Snapshot>-published per-batch
visibility watermark. Readers grab a snapshot, walk per-term chunks
filtered by chunk.batch_position < snapshot.visible_count, and score
with snapshot-coupled BM25 stats. Writer publishes the snapshot only
after every term chunk for a batch is linked, so readers never observe
a partial document or BM25 stats out of sync with the postings.

Also: tokenize-time tokenizer ownership moves out of a shared Mutex
into a tokenizer pool plus a writer-dedicated slot, so search calls do
not serialize against the writer; add Utf8View to the supported text
types; add memory_usage() for size-based flush triggers; reuse
lance-index InvertedIndex's TokenSet/DocSet/PostingListBuilder for the
flush path.
Boolean and Boost queries called search_query recursively for each
sub-clause, and every leaf (search / search_phrase / search_fuzzy)
took its own self.snapshot.load_full(). A writer publishing between
sub-queries left the compound result mixing BM25 stats from different
snapshots — n, avgdl, and df disagreed across leaves of the same
query, so the summed score wasn't valid for any point in time.

Snapshot once at search_query / search_with_options and pass the
Arc<Snapshot> through every leaf and every recursive call. Public
search* entry points keep their signatures and snapshot internally.

Also: filter entry_count() by visibility; correct stale doc-comments
on Snapshot::batches, TermChunk::positions, and Snapshot::batch_for;
hoist test-body 'use std::sync::Arc;' to the test module's top.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

New rust/lance/benches/mem_wal_fineweb_fts.rs covers three metrics
across the 12 configs in the design doc:

- write throughput at memtable sizes 100k / 500k / 1M
- MemTable FTS query latency (avg/p50/p95) over 100 high-frequency
  tokens + 50 sampled phrases
- consistency: |memtable_top10 ∩ post_flush_disk_top10| / |union| as a
  user-approved replacement for recall@k

The bench downloads HuggingFaceFW/fineweb sample/10BT shards, caches
them, and is fully env-driven so a single binary handles every config.
Driver script bench/run_fineweb_fts.sh loops the 12 configs, uploads
each result.json to S3, and prints a summary.

Also: make `dataset::mem_wal::index` public so the bench can call
`FtsMemIndex::search_with_options` directly to time the MemTable read
path.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The previous spin on max_indexed_batch_position never terminated when
max_memtable_rows triggers auto-flushes during ingest: the counter is
reset on each new active memtable, so target_batch_pos = total_batches
- 1 is unreachable from the active generation.

Close the writer instead — it drains the final WAL flush and any
outstanding memtable flush; the inline sync_indexed_write covers the
per-put index updates. close() time is included in the measured
elapsed so configs with different flush cadences are compared
apples-to-apples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant