Skip to content

feat: HuggingFace Hub storage backend and CDC table properties#2375

Open
kszucs wants to merge 2 commits into
apache:mainfrom
kszucs:opendal-hf
Open

feat: HuggingFace Hub storage backend and CDC table properties#2375
kszucs wants to merge 2 commits into
apache:mainfrom
kszucs:opendal-hf

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Apr 27, 2026

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Adds two opt-in capabilities for storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication.

HuggingFace Hub storage

New opendal-hf feature on iceberg-storage-opendal (off by default, included in opendal-all) that wires HuggingFace's OpenDAL service into FileIO. Paths use the form:

hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo>

where repo_type must be one of models, datasets, spaces, or buckets. The prefix is mandatory. Configuration via FileIOBuilder properties:

  • hf.token — API token (required for private repos / writes)
  • hf.endpoint — Hub endpoint, defaults to https://huggingface.co
  • hf.revision — fallback revision when a path has no @<revision>

OpenDalResolvingStorage recognises the hf scheme and lazily constructs a per-scheme storage instance. delete_stream groups paths by <repo_type>/<repo_id> so bucket and dataset paths to the same repo do not share an operator.

CDC (content-defined chunking) table properties

New table properties under write.parquet.content-defined-chunking.* (matching PyIceberg convention):

  • write.parquet.content-defined-chunking.enabled (bool, default false)
  • write.parquet.content-defined-chunking.min-chunk-size (bytes, default 256 KiB)
  • write.parquet.content-defined-chunking.max-chunk-size (bytes, default 1 MiB)
  • write.parquet.content-defined-chunking.norm-level (i32, default 0)

CDC activates only when enabled = "true" is set explicitly. Defaults match parquet's own CdcOptions defaults. CDC options are applied in the DataFusion physical write plan.

Are these changes tested?

  • Rust unit tests for HfUri parsing and CDC property parsing.
  • Rust integration tests in file_io_hf_test.rs guarded on HF_OPENDAL_TOKEN, HF_OPENDAL_BUCKET, HF_OPENDAL_DATASET; tests skip gracefully when env vars are unset.
  • Python tests in test_huggingface_and_cdc.py covering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped without HF_OPENDAL_TOKEN / HF_OPENDAL_TABLE_METADATA).

@kszucs kszucs force-pushed the opendal-hf branch 3 times, most recently from 0f6b02e to ca61a05 Compare April 27, 2026 08:16
Comment thread Cargo.toml Outdated
Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kszucs , let's hold this pr for a while to wait for opendal to release.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 28, 2026

Sure, should be released soon enough. I also need to set up some additional testing.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented May 4, 2026

Pulled out the opendal 0.56 bump into a separate PR #2401

@kszucs kszucs force-pushed the opendal-hf branch 3 times, most recently from d7550d9 to c7d0b6f Compare May 8, 2026 07:52
Comment thread bindings/python/tests/test_hf_and_cdc.py Outdated
@kszucs kszucs force-pushed the opendal-hf branch 2 times, most recently from 22afdec to a82fbd8 Compare May 11, 2026 17:09
Adds two opt-in capabilities for storing Iceberg tables on HuggingFace
Hub with content-defined chunking for efficient deduplication.

## HuggingFace Hub storage

New `opendal-hf` feature on `iceberg-storage-opendal` (off by default,
included in `opendal-all`) that wires HuggingFace's OpenDAL service into
`FileIO`. Paths use the form:

  hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo>

where `repo_type` must be one of `models`, `datasets`, `spaces`, or
`buckets` (XET-backed object storage). The prefix is mandatory — there
is no implicit default. Configuration is passed via `FileIOBuilder`
properties:

  - `hf.token`     — API token (required for private repos / writes)
  - `hf.endpoint`  — Hub endpoint, defaults to https://huggingface.co
  - `hf.revision`  — fallback revision when a path has no `@<revision>`

`OpenDalResolvingStorage` recognises the `hf` scheme and lazily
constructs a per-scheme storage instance. `delete_stream` groups paths
by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same
repo do not share an operator.

## CDC (content-defined chunking) table properties

New table properties under the `write.parquet.content-defined-chunking.*`
namespace (matching PyIceberg convention):

  - `write.parquet.content-defined-chunking.enabled`        (bool, default false)
  - `write.parquet.content-defined-chunking.min-chunk-size` (bytes, default 256 KiB)
  - `write.parquet.content-defined-chunking.max-chunk-size` (bytes, default 1 MiB)
  - `write.parquet.content-defined-chunking.norm-level`     (i32, default 0)

CDC is opt-in: it activates only when `enabled = "true"` is set
explicitly. Size/level properties without the enabled flag are parsed
and stored but have no effect. Defaults match parquet's own
`CdcOptions` defaults so the Iceberg layer stays in sync. CDC options
are applied directly in the DataFusion physical write plan.
Two jobs gated on HF_TOKEN: Rust opendal-hf integration tests and
Python CDC + HF tests. The Python HF test writes a table via PyIceberg
and reads it back via IcebergDataFusionTable using the opendal-hf backend.

Env vars: HF_TOKEN, HF_BUCKET, HF_DATASET.
@kszucs kszucs requested a review from blackmwk May 11, 2026 18:05
name: HuggingFace Hub integration tests
runs-on: ubuntu-latest
# Skip the job entirely when HF secrets are not available (e.g. PRs from forks).
if: ${{ secrets.HF_TOKEN != '' }}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF doesn't have a minio-like setup, so we should configure a huggingface free account for the CI.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented May 11, 2026

@Xuanwo could you please take a look?

/// Only the fields required by this crate are stored; revision is consumed
/// during parsing but not retained.
#[derive(Debug, Clone, PartialEq, Eq)]
pub(crate) struct HfUri {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost identical to HfUri in opendal just not exposed yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants