Skip to content

feat: add Git-style delta encoding for blob columns#6733

Open
beinan wants to merge 4 commits into
lance-format:mainfrom
beinan:user/beinan/delta-blob-encoding
Open

feat: add Git-style delta encoding for blob columns#6733
beinan wants to merge 4 commits into
lance-format:mainfrom
beinan:user/beinan/delta-blob-encoding

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented May 11, 2026

Summary

  • Implements binary delta encoding for LargeBinary blob columns, inspired by Git's packfile delta format
  • When consecutive rows contain similar content (e.g., successive versions of source files), only the differences are stored via copy/insert instructions
  • Builds on top of existing blob encoding infrastructure using external buffers — no changes to page structure

Design

  • Delta algorithm: Rabin rolling hash (16-byte windows) indexes the base buffer; the target is scanned with the same hash to find matching regions. Matches ≥4 bytes become copy instructions, unmatched regions become insert instructions. Git-compatible opcode format.
  • Delta groups: Consecutive values are grouped with configurable chain depth (default 4). First value stored as-is (base), subsequent values as chained deltas against the previous value. A new base is emitted when the chain depth is exceeded or when a delta is larger than the raw value.
  • Descriptor: Extended struct with (position, size, kind, base_offset)kind distinguishes DeltaBase vs Delta, base_offset stores distance back to the base.
  • Decoding: DeltaBlobPageScheduler expands requested ranges to include required bases, loads all data, applies delta chains, then returns only the requested rows.

Files Changed

Component Files
Delta algorithm rust/lance-encoding/src/encodings/physical/delta.rs (new)
BlobKind variants rust/lance-core/src/datatypes.rs, field.rs
Metadata key rust/lance-arrow/src/lib.rs
Encoder rust/lance-encoding/src/encodings/logical/blob.rs
Decoder rust/lance-encoding/src/encodings/logical/primitive/blob.rs
Protobuf protos/encodings_v2_1.proto
Wiring encoder.rs, decoder.rs, primitive.rs, format.rs, testing.rs

Usage

Set field metadata lance-encoding:delta-blob = "true" on a LargeBinary column.

Test plan

  • Delta algorithm unit tests (9 tests): varint roundtrip, identical buffers, small edits, larger files, chained deltas, source code diffs
  • Delta blob round-trip tests (4 tests): similar source code versions, nulls, completely different data, larger files
  • All existing blob tests pass (9 tests) — no regressions

🤖 Generated with Claude Code

Implement binary delta encoding for LargeBinary blob columns, inspired
by Git's packfile delta format. When consecutive rows contain similar
content (e.g., successive versions of source files), only the
differences are stored, significantly reducing storage.

Key components:
- Rabin rolling hash delta algorithm (copy/insert instructions)
- DeltaBlobStructuralEncoder with configurable chain depth
- DeltaBlobPageScheduler for decoding delta chains
- New DeltaBlobLayout protobuf variant
- DeltaBase/Delta variants in BlobKind

Enable via field metadata: lance-encoding:delta-blob = "true"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 11, 2026
beinan and others added 3 commits May 12, 2026 05:34
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix decoder null/empty chain: track last reconstructed row instead of
  assuming row-1, so nulls between values don't break delta chains
- Add bounds checking to decode_varint (Truncated on short input)
- Add overflow protection to decode_varint (VarIntOverflow after 10 bytes)
- Add bounds checks to all copy instruction byte reads in apply_delta
- Change delta threshold from 1.0 to 0.8 — require 20% savings
- Remove unnecessary source.to_vec() in DeltaIndex — borrow with lifetime
- Use [u8; 7] stack array in encode_copy instead of Vec heap allocation
- Implement true O(1) Rabin rolling hash (rabin_hash_roll) with
  precomputed RABIN_POW_WINDOW constant
- Add tests for varint truncation, overflow, and rolling hash consistency
- Add doc comment noting data should be sorted for best compression

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
// kind indicates whether the value is a base (stored as-is) or a delta
// (stored as copy/insert instructions relative to the previous value in the chain).
// base_offset is the row distance back to the base value in the group.
message DeltaBlobLayout {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add a new encoding layout, a vote from the PMC is required. Could you start a discussion about this first?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started a PMC vote discussion: #6736

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants