feat: HuggingFace Hub storage backend and CDC table properties#2375
Open
kszucs wants to merge 2 commits into
Open
feat: HuggingFace Hub storage backend and CDC table properties#2375kszucs wants to merge 2 commits into
kszucs wants to merge 2 commits into
Conversation
0f6b02e to
ca61a05
Compare
kszucs
commented
Apr 27, 2026
Member
Author
|
Sure, should be released soon enough. I also need to set up some additional testing. |
Member
Author
|
Pulled out the opendal 0.56 bump into a separate PR #2401 |
d7550d9 to
c7d0b6f
Compare
kszucs
commented
May 11, 2026
22afdec to
a82fbd8
Compare
Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication. ## HuggingFace Hub storage New `opendal-hf` feature on `iceberg-storage-opendal` (off by default, included in `opendal-all`) that wires HuggingFace's OpenDAL service into `FileIO`. Paths use the form: hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo> where `repo_type` must be one of `models`, `datasets`, `spaces`, or `buckets` (XET-backed object storage). The prefix is mandatory — there is no implicit default. Configuration is passed via `FileIOBuilder` properties: - `hf.token` — API token (required for private repos / writes) - `hf.endpoint` — Hub endpoint, defaults to https://huggingface.co - `hf.revision` — fallback revision when a path has no `@<revision>` `OpenDalResolvingStorage` recognises the `hf` scheme and lazily constructs a per-scheme storage instance. `delete_stream` groups paths by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same repo do not share an operator. ## CDC (content-defined chunking) table properties New table properties under the `write.parquet.content-defined-chunking.*` namespace (matching PyIceberg convention): - `write.parquet.content-defined-chunking.enabled` (bool, default false) - `write.parquet.content-defined-chunking.min-chunk-size` (bytes, default 256 KiB) - `write.parquet.content-defined-chunking.max-chunk-size` (bytes, default 1 MiB) - `write.parquet.content-defined-chunking.norm-level` (i32, default 0) CDC is opt-in: it activates only when `enabled = "true"` is set explicitly. Size/level properties without the enabled flag are parsed and stored but have no effect. Defaults match parquet's own `CdcOptions` defaults so the Iceberg layer stays in sync. CDC options are applied directly in the DataFusion physical write plan.
Two jobs gated on HF_TOKEN: Rust opendal-hf integration tests and Python CDC + HF tests. The Python HF test writes a table via PyIceberg and reads it back via IcebergDataFusionTable using the opendal-hf backend. Env vars: HF_TOKEN, HF_BUCKET, HF_DATASET.
kszucs
commented
May 11, 2026
| name: HuggingFace Hub integration tests | ||
| runs-on: ubuntu-latest | ||
| # Skip the job entirely when HF secrets are not available (e.g. PRs from forks). | ||
| if: ${{ secrets.HF_TOKEN != '' }} |
Member
Author
There was a problem hiding this comment.
HF doesn't have a minio-like setup, so we should configure a huggingface free account for the CI.
Member
Author
|
@Xuanwo could you please take a look? |
kszucs
commented
May 11, 2026
| /// Only the fields required by this crate are stored; revision is consumed | ||
| /// during parsing but not retained. | ||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub(crate) struct HfUri { |
Member
Author
There was a problem hiding this comment.
Almost identical to HfUri in opendal just not exposed yet.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication.
HuggingFace Hub storage
New
opendal-hffeature oniceberg-storage-opendal(off by default, included inopendal-all) that wires HuggingFace's OpenDAL service intoFileIO. Paths use the form:hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo>where
repo_typemust be one ofmodels,datasets,spaces, orbuckets. The prefix is mandatory. Configuration viaFileIOBuilderproperties:hf.token— API token (required for private repos / writes)hf.endpoint— Hub endpoint, defaults to https://huggingface.cohf.revision— fallback revision when a path has no@<revision>OpenDalResolvingStoragerecognises thehfscheme and lazily constructs a per-scheme storage instance.delete_streamgroups paths by<repo_type>/<repo_id>so bucket and dataset paths to the same repo do not share an operator.CDC (content-defined chunking) table properties
New table properties under
write.parquet.content-defined-chunking.*(matching PyIceberg convention):write.parquet.content-defined-chunking.enabled(bool, default false)write.parquet.content-defined-chunking.min-chunk-size(bytes, default 256 KiB)write.parquet.content-defined-chunking.max-chunk-size(bytes, default 1 MiB)write.parquet.content-defined-chunking.norm-level(i32, default 0)CDC activates only when
enabled = "true"is set explicitly. Defaults match parquet's ownCdcOptionsdefaults. CDC options are applied in the DataFusion physical write plan.Are these changes tested?
HfUriparsing and CDC property parsing.file_io_hf_test.rsguarded onHF_OPENDAL_TOKEN,HF_OPENDAL_BUCKET,HF_OPENDAL_DATASET; tests skip gracefully when env vars are unset.test_huggingface_and_cdc.pycovering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped withoutHF_OPENDAL_TOKEN/HF_OPENDAL_TABLE_METADATA).