feat: add Git-style delta encoding for blob columns#6733
Open
beinan wants to merge 4 commits into
Open
Conversation
Implement binary delta encoding for LargeBinary blob columns, inspired by Git's packfile delta format. When consecutive rows contain similar content (e.g., successive versions of source files), only the differences are stored, significantly reducing storage. Key components: - Rabin rolling hash delta algorithm (copy/insert instructions) - DeltaBlobStructuralEncoder with configurable chain depth - DeltaBlobPageScheduler for decoding delta chains - New DeltaBlobLayout protobuf variant - DeltaBase/Delta variants in BlobKind Enable via field metadata: lance-encoding:delta-blob = "true" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix decoder null/empty chain: track last reconstructed row instead of assuming row-1, so nulls between values don't break delta chains - Add bounds checking to decode_varint (Truncated on short input) - Add overflow protection to decode_varint (VarIntOverflow after 10 bytes) - Add bounds checks to all copy instruction byte reads in apply_delta - Change delta threshold from 1.0 to 0.8 — require 20% savings - Remove unnecessary source.to_vec() in DeltaIndex — borrow with lifetime - Use [u8; 7] stack array in encode_copy instead of Vec heap allocation - Implement true O(1) Rabin rolling hash (rabin_hash_roll) with precomputed RABIN_POW_WINDOW constant - Add tests for varint truncation, overflow, and rolling hash consistency - Add doc comment noting data should be sorted for best compression Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Xuanwo
reviewed
May 12, 2026
| // kind indicates whether the value is a base (stored as-is) or a delta | ||
| // (stored as copy/insert instructions relative to the previous value in the chain). | ||
| // base_offset is the row distance back to the base value in the group. | ||
| message DeltaBlobLayout { |
Collaborator
There was a problem hiding this comment.
To add a new encoding layout, a vote from the PMC is required. Could you start a discussion about this first?
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LargeBinaryblob columns, inspired by Git's packfile delta formatDesign
(position, size, kind, base_offset)—kinddistinguishesDeltaBasevsDelta,base_offsetstores distance back to the base.DeltaBlobPageSchedulerexpands requested ranges to include required bases, loads all data, applies delta chains, then returns only the requested rows.Files Changed
rust/lance-encoding/src/encodings/physical/delta.rs(new)rust/lance-core/src/datatypes.rs,field.rsrust/lance-arrow/src/lib.rsrust/lance-encoding/src/encodings/logical/blob.rsrust/lance-encoding/src/encodings/logical/primitive/blob.rsprotos/encodings_v2_1.protoencoder.rs,decoder.rs,primitive.rs,format.rs,testing.rsUsage
Set field metadata
lance-encoding:delta-blob = "true"on aLargeBinarycolumn.Test plan
🤖 Generated with Claude Code