feat: HuggingFace Hub storage backend and CDC table properties by kszucs · Pull Request #2375 · apache/iceberg-rust

kszucs · 2026-04-27T08:01:44Z

Which issue does this PR close?

Closes #.

What changes are included in this PR?

Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication.

HuggingFace Hub storage

New opendal-hf feature on iceberg-storage-opendal (off by default, included in opendal-all) that wires HuggingFace's OpenDAL service into FileIO. Paths use the form:

hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo>

where repo_type must be one of models, datasets, spaces, or buckets. The prefix is mandatory. Configuration via FileIOBuilder properties:

hf.token — API token (required for private repos / writes)
hf.endpoint — Hub endpoint, defaults to https://huggingface.co
hf.revision — fallback revision when a path has no @<revision>

OpenDalResolvingStorage recognises the hf scheme and lazily constructs a per-scheme storage instance. delete_stream groups paths by <repo_type>/<repo_id> so bucket and dataset paths to the same repo do not share an operator.

CDC (content-defined chunking) table properties

New table properties under write.parquet.content-defined-chunking.* (matching PyIceberg convention):

write.parquet.content-defined-chunking.enabled (bool, default false)
write.parquet.content-defined-chunking.min-chunk-size (bytes, default 256 KiB)
write.parquet.content-defined-chunking.max-chunk-size (bytes, default 1 MiB)
write.parquet.content-defined-chunking.norm-level (i32, default 0)

CDC activates only when enabled = "true" is set explicitly. Defaults match parquet's own CdcOptions defaults. CDC options are applied in the DataFusion physical write plan.

Are these changes tested?

Rust unit tests for HfUri parsing and CDC property parsing.
Rust integration tests in file_io_hf_test.rs guarded on HF_OPENDAL_TOKEN, HF_OPENDAL_BUCKET, HF_OPENDAL_DATASET; tests skip gracefully when env vars are unset.
Python tests in test_huggingface_and_cdc.py covering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped without HF_OPENDAL_TOKEN / HF_OPENDAL_TABLE_METADATA).

blackmwk

Thanks @kszucs , let's hold this pr for a while to wait for opendal to release.

kszucs · 2026-04-28T11:02:40Z

Sure, should be released soon enough. I also need to set up some additional testing.

kszucs · 2026-05-04T07:12:15Z

Pulled out the opendal 0.56 bump into a separate PR #2401

Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication. ## HuggingFace Hub storage New `opendal-hf` feature on `iceberg-storage-opendal` (off by default, included in `opendal-all`) that wires HuggingFace's OpenDAL service into `FileIO`. Paths use the form: hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo> where `repo_type` must be one of `models`, `datasets`, `spaces`, or `buckets` (XET-backed object storage). The prefix is mandatory — there is no implicit default. Configuration is passed via `FileIOBuilder` properties: - `hf.token` — API token (required for private repos / writes) - `hf.endpoint` — Hub endpoint, defaults to https://huggingface.co - `hf.revision` — fallback revision when a path has no `@<revision>` `OpenDalResolvingStorage` recognises the `hf` scheme and lazily constructs a per-scheme storage instance. `delete_stream` groups paths by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same repo do not share an operator. ## CDC (content-defined chunking) table properties New table properties under the `write.parquet.content-defined-chunking.*` namespace (matching PyIceberg convention): - `write.parquet.content-defined-chunking.enabled` (bool, default false) - `write.parquet.content-defined-chunking.min-chunk-size` (bytes, default 256 KiB) - `write.parquet.content-defined-chunking.max-chunk-size` (bytes, default 1 MiB) - `write.parquet.content-defined-chunking.norm-level` (i32, default 0) CDC is opt-in: it activates only when `enabled = "true"` is set explicitly. Size/level properties without the enabled flag are parsed and stored but have no effect. Defaults match parquet's own `CdcOptions` defaults so the Iceberg layer stays in sync. CDC options are applied directly in the DataFusion physical write plan.

Two jobs gated on HF_TOKEN: Rust opendal-hf integration tests and Python CDC + HF tests. The Python HF test writes a table via PyIceberg and reads it back via IcebergDataFusionTable using the opendal-hf backend. Env vars: HF_TOKEN, HF_BUCKET, HF_DATASET.

kszucs · 2026-05-11T18:06:02Z

+    name: HuggingFace Hub integration tests
+    runs-on: ubuntu-latest
+    # Skip the job entirely when HF secrets are not available (e.g. PRs from forks).
+    if: ${{ secrets.HF_TOKEN != '' }}


HF doesn't have a minio-like setup, so we should configure a huggingface free account for the CI.

kszucs · 2026-05-11T18:06:18Z

@Xuanwo could you please take a look?

kszucs · 2026-05-11T18:07:19Z

+/// Only the fields required by this crate are stored; revision is consumed
+/// during parsing but not retained.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub(crate) struct HfUri {


Almost identical to HfUri in opendal just not exposed yet.

kszucs force-pushed the opendal-hf branch 3 times, most recently from 0f6b02e to ca61a05 Compare April 27, 2026 08:16

kszucs commented Apr 27, 2026

View reviewed changes

Comment thread Cargo.toml Outdated

blackmwk requested changes Apr 28, 2026

View reviewed changes

kszucs force-pushed the opendal-hf branch 3 times, most recently from d7550d9 to c7d0b6f Compare May 8, 2026 07:52

kszucs force-pushed the opendal-hf branch from c7d0b6f to d5c56f1 Compare May 11, 2026 10:26

kszucs commented May 11, 2026

View reviewed changes

Comment thread bindings/python/tests/test_hf_and_cdc.py Outdated

kszucs force-pushed the opendal-hf branch 2 times, most recently from 22afdec to a82fbd8 Compare May 11, 2026 17:09

kszucs force-pushed the opendal-hf branch from a82fbd8 to ef16f03 Compare May 11, 2026 17:40

kszucs requested a review from blackmwk May 11, 2026 18:05

kszucs commented May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HuggingFace Hub storage backend and CDC table properties#2375

feat: HuggingFace Hub storage backend and CDC table properties#2375
kszucs wants to merge 2 commits into
apache:mainfrom
kszucs:opendal-hf

kszucs commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

blackmwk left a comment

Uh oh!

kszucs commented Apr 28, 2026

Uh oh!

kszucs commented May 4, 2026

Uh oh!

Uh oh!

kszucs May 11, 2026

Uh oh!

kszucs commented May 11, 2026

Uh oh!

kszucs May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kszucs commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

HuggingFace Hub storage

CDC (content-defined chunking) table properties

Are these changes tested?

Uh oh!

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

kszucs commented Apr 28, 2026

Uh oh!

kszucs commented May 4, 2026

Uh oh!

Uh oh!

kszucs May 11, 2026

Choose a reason for hiding this comment

Uh oh!

kszucs commented May 11, 2026

Uh oh!

kszucs May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kszucs commented Apr 27, 2026 •

edited

Loading