refactor(mem_wal): redesign FTS mem index for single-writer multi-reader by touch-of-grey · Pull Request #6726 · lance-format/lance

touch-of-grey · 2026-05-10T15:42:57Z

Summary

Replaces the SkipMap<(token, row_position)> postings layout in
rust/lance/src/dataset/mem_wal/index/fts.rs with per-term
ArcSwap<TermSlice> slices and an ArcSwap<Snapshot>-published
per-batch visibility watermark. Readers grab the snapshot, walk per-term
chunks filtered by chunk.batch_position < snapshot.visible_count, and
score with snapshot-coupled BM25 stats.

The writer publishes the snapshot only after every term chunk for a
batch is linked, so readers never observe a partial document or BM25
stats out of sync with the postings — per-batch monotonic visibility.

Also in this PR:

Tokenize-time tokenizer ownership moves out of a single shared Mutex
into a small reader pool plus a writer-dedicated slot, so search
calls no longer serialize against the writer.
Utf8View is added to the supported text types alongside Utf8 /
LargeUtf8; non-string columns now produce a clear error instead of
being silently skipped.
FtsMemIndex::memory_usage() for size-based flush triggers.
Compound queries (Boolean, Boost) snapshot once at search_query
and thread the same Arc<Snapshot> through every leaf, so a
compound query never mixes BM25 stats from different snapshots.
The flush path keeps using lance_index::scalar::inverted types
(TokenSet / DocSet / PostingListBuilder); the on-disk format is
unchanged. The stale "flush is not yet implemented" header doc is
corrected.

Public API (FtsMemIndex constructors, FtsQueryExpr, SearchOptions,
BooleanQueryBuilder, FtsIndexConfig, to_index_builder_reversed)
is preserved. No callers of the removed internal FtsKey /
PostingValue types existed outside the file.

cc @jackye1995 for review.

Replace the SkipMap<(token, row_position)> postings layout with per-term ArcSwap<TermSlice> slices and an ArcSwap<Snapshot>-published per-batch visibility watermark. Readers grab a snapshot, walk per-term chunks filtered by chunk.batch_position < snapshot.visible_count, and score with snapshot-coupled BM25 stats. Writer publishes the snapshot only after every term chunk for a batch is linked, so readers never observe a partial document or BM25 stats out of sync with the postings. Also: tokenize-time tokenizer ownership moves out of a shared Mutex into a tokenizer pool plus a writer-dedicated slot, so search calls do not serialize against the writer; add Utf8View to the supported text types; add memory_usage() for size-based flush triggers; reuse lance-index InvertedIndex's TokenSet/DocSet/PostingListBuilder for the flush path.

Boolean and Boost queries called search_query recursively for each sub-clause, and every leaf (search / search_phrase / search_fuzzy) took its own self.snapshot.load_full(). A writer publishing between sub-queries left the compound result mixing BM25 stats from different snapshots — n, avgdl, and df disagreed across leaves of the same query, so the summed score wasn't valid for any point in time. Snapshot once at search_query / search_with_options and pass the Arc<Snapshot> through every leaf and every recursive call. Public search* entry points keep their signatures and snapshot internally. Also: filter entry_count() by visibility; correct stale doc-comments on Snapshot::batches, TermChunk::positions, and Snapshot::batch_for; hoist test-body 'use std::sync::Arc;' to the test module's top.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

New rust/lance/benches/mem_wal_fineweb_fts.rs covers three metrics across the 12 configs in the design doc: - write throughput at memtable sizes 100k / 500k / 1M - MemTable FTS query latency (avg/p50/p95) over 100 high-frequency tokens + 50 sampled phrases - consistency: |memtable_top10 ∩ post_flush_disk_top10| / |union| as a user-approved replacement for recall@k The bench downloads HuggingFaceFW/fineweb sample/10BT shards, caches them, and is fully env-driven so a single binary handles every config. Driver script bench/run_fineweb_fts.sh loops the 12 configs, uploads each result.json to S3, and prints a summary. Also: make `dataset::mem_wal::index` public so the bench can call `FtsMemIndex::search_with_options` directly to time the MemTable read path.

codecov · 2026-05-10T18:47:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The previous spin on max_indexed_batch_position never terminated when max_memtable_rows triggers auto-flushes during ingest: the counter is reset on each new active memtable, so target_batch_pos = total_batches - 1 is unreachable from the active generation. Close the writer instead — it drains the final WAL flush and any outstanding memtable flush; the inline sync_indexed_write covers the per-put index updates. close() time is included in the measured elapsed so configs with different flush cadences are compared apples-to-apples.

touch-of-grey added 3 commits May 10, 2026 01:33

build: refresh python/Cargo.lock for arc-swap dependency

bbdf8d3

claude Bot reviewed May 10, 2026

View reviewed changes

github-actions Bot added the python label May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(mem_wal): redesign FTS mem index for single-writer multi-reader#6726

refactor(mem_wal): redesign FTS mem index for single-writer multi-reader#6726
touch-of-grey wants to merge 5 commits into
lance-format:mainfrom
touch-of-grey:MemTableFTSBetter

touch-of-grey commented May 10, 2026 •

edited by jackye1995

Loading

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

touch-of-grey commented May 10, 2026 • edited by jackye1995 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 10, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

touch-of-grey commented May 10, 2026 •

edited by jackye1995

Loading