Native tokenization engine, drop dask runtime dep by ptomecek · Pull Request #208 · Point72/ccflow

ptomecek · 2026-05-12T01:16:13Z

Replaces the dask.base.normalize_token / tokenize re-exports with a native singledispatch engine, and drops dask from runtime dependencies (adds cloudpickle).

Builds on #196 — compute_data_token, compute_cache_token, and compute_behavior_token keep their existing semantics; the underlying data-tokenization backend changes from dask to the native engine.

What's covered

Structural handlers for: stdlib primitives, datetime / date / time / timedelta, Decimal, UUID, pathlib, Enum, complex, Ellipsis, slice, functools.partial, MappingProxyType, OrderedDict (order-preserving), types.MethodType / MethodWrapperType, types.CodeType, types.FunctionType (defaults / kwdefaults / closure cells participate), type, and pydantic.BaseModel. Lazy-registered handlers for numpy.ndarray (object-dtype arrays recurse element-wise) and numpy.generic. Unknown types fall back to cloudpickle, raising TypeError if pickling fails.

tokenize() keeps its variadic (*args, **kwargs) signature for drop-in compatibility with the dask API.

Cycle detection via a module-level ContextVar returns a stable (\"__cycle__\", type_name) marker instead of raising RecursionError.

Performance vs dask

Rough one-off comparisons (μs/op):

Workload	dask	native	ratio
Small primitive dict	8.9	5.8	0.65x
Pydantic BaseModel (6 fields)	1910	8.7	0.005x (~220x faster)
Nested BaseModel (100 sub-models)	3778	704	0.19x (~5x faster)
Function with comprehension	38.6	7.8	0.20x (~5x faster)
ndarray int64, 1k	5.6	9.1	1.6x
ndarray float64, 1M	441	4539	10x slower
ndarray object, 1k strings	13.4	236	18x slower
pd.Series int64, 1k	11.2	31.6	3x slower

Wins are concentrated where ccflow actually computes cache keys: pydantic models and their compositions (~220x), small dicts, and functions. Losses are on bulk numpy / pandas data — dask streams a contiguous view directly into the hasher and has fast paths for string-dtype arrays; this implementation does `tobytes() → repr() → encode() → sha256` and recurses element-wise for object arrays. Worth a follow-up if multi-MB arrays become typical cache inputs.

Tests

818 passing (+128 new), including:

determinism / structural-portability regression tests for closures, comprehensions, nested defs, lambdas, object-dtype ndarrays
collision tests for OrderedDict order-sensitivity, distinct method bindings
cycle-detection coverage (self-referential dicts, lists, pydantic models, shared sub-objects not false-flagged as cycles)
variadic tokenize signature semantics

Out of scope

BaseModel.model_token property, frozen-model Merkle caching, __ccflow_tokenize__ extension hook — slated for a follow-up PR
Streaming hash for large numpy arrays — perf follow-up

github-actions · 2026-05-12T01:18:51Z

Test Results

836 tests +107 834 ✅ +107 1m 45s ⏱️ +5s
1 suites ± 0 2 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit 6977e5b. ± Comparison against base commit 43eb0c5.

♻️ This comment has been updated with latest results.

codecov · 2026-05-12T01:19:36Z

Codecov Report

❌ Patch coverage is 95.79968% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.34%. Comparing base (43eb0c5) to head (6977e5b).

Files with missing lines	Patch %	Lines
ccflow/tests/utils/test_tokenize.py	94.40%	25 Missing ⚠️
ccflow/utils/tokenize.py	99.41%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #208      +/-   ##
==========================================
+ Coverage   95.33%   95.34%   +0.01%     
==========================================
  Files         142      142              
  Lines       10714    11327     +613     
  Branches      613      617       +4     
==========================================
+ Hits        10214    10800     +586     
- Misses        374      399      +25     
- Partials      126      128       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replaces the dask-based normalize_token / tokenize re-exports with a native singledispatch engine. Adds structural handlers for common types (stdlib, datetime, Decimal, UUID, pathlib, Enum, partial, MappingProxy, methods, OrderedDict, code objects, functions, types, pydantic BaseModel) and lazy-registered handlers for numpy. Unknown types fall back to a cloudpickle digest, raising TypeError when pickling fails. The public tokenize() retains its variadic (*args, **kwargs) signature for drop-in compatibility. compute_data_token, compute_cache_token, and compute_behavior_token from PR #196 are unchanged. Cycle detection via a module-level ContextVar prevents RecursionError on self-referential structures and emits a stable __cycle__ marker. Drops dask from runtime dependencies; adds cloudpickle. Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

ptomecek force-pushed the pit/native-tokenize branch 2 times, most recently from 3a9b80e to c312f6f Compare May 12, 2026 21:28

timkpaine reviewed May 12, 2026

View reviewed changes

Comment thread ccflow/tests/utils/test_tokenize.py Outdated

ptomecek force-pushed the pit/native-tokenize branch 2 times, most recently from 4dfb495 to 3c4d2ea Compare May 12, 2026 21:42

ptomecek force-pushed the pit/native-tokenize branch from 3c4d2ea to 6977e5b Compare May 12, 2026 21:56

ptomecek marked this pull request as ready for review May 12, 2026 21:59

ptomecek requested review from feussy and hintse as code owners May 12, 2026 21:59

timkpaine merged commit 997827f into main May 13, 2026
12 checks passed

timkpaine deleted the pit/native-tokenize branch May 13, 2026 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native tokenization engine, drop dask runtime dep#208

Native tokenization engine, drop dask runtime dep#208
timkpaine merged 1 commit into
mainfrom
pit/native-tokenize

ptomecek commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ptomecek commented May 12, 2026

What's covered

Performance vs dask

Tests

Out of scope

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 12, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading