Skip to content

Native tokenization engine, drop dask runtime dep#208

Merged
timkpaine merged 1 commit into
mainfrom
pit/native-tokenize
May 13, 2026
Merged

Native tokenization engine, drop dask runtime dep#208
timkpaine merged 1 commit into
mainfrom
pit/native-tokenize

Conversation

@ptomecek
Copy link
Copy Markdown
Collaborator

Replaces the dask.base.normalize_token / tokenize re-exports with a native singledispatch engine, and drops dask from runtime dependencies (adds cloudpickle).

Builds on #196compute_data_token, compute_cache_token, and compute_behavior_token keep their existing semantics; the underlying data-tokenization backend changes from dask to the native engine.

What's covered

Structural handlers for: stdlib primitives, datetime / date / time / timedelta, Decimal, UUID, pathlib, Enum, complex, Ellipsis, slice, functools.partial, MappingProxyType, OrderedDict (order-preserving), types.MethodType / MethodWrapperType, types.CodeType, types.FunctionType (defaults / kwdefaults / closure cells participate), type, and pydantic.BaseModel. Lazy-registered handlers for numpy.ndarray (object-dtype arrays recurse element-wise) and numpy.generic. Unknown types fall back to cloudpickle, raising TypeError if pickling fails.

tokenize() keeps its variadic (*args, **kwargs) signature for drop-in compatibility with the dask API.

Cycle detection via a module-level ContextVar returns a stable (\"__cycle__\", type_name) marker instead of raising RecursionError.

Performance vs dask

Rough one-off comparisons (μs/op):

Workload dask native ratio
Small primitive dict 8.9 5.8 0.65x
Pydantic BaseModel (6 fields) 1910 8.7 0.005x (~220x faster)
Nested BaseModel (100 sub-models) 3778 704 0.19x (~5x faster)
Function with comprehension 38.6 7.8 0.20x (~5x faster)
ndarray int64, 1k 5.6 9.1 1.6x
ndarray float64, 1M 441 4539 10x slower
ndarray object, 1k strings 13.4 236 18x slower
pd.Series int64, 1k 11.2 31.6 3x slower

Wins are concentrated where ccflow actually computes cache keys: pydantic models and their compositions (~220x), small dicts, and functions. Losses are on bulk numpy / pandas data — dask streams a contiguous view directly into the hasher and has fast paths for string-dtype arrays; this implementation does `tobytes() → repr() → encode() → sha256` and recurses element-wise for object arrays. Worth a follow-up if multi-MB arrays become typical cache inputs.

Tests

818 passing (+128 new), including:

  • determinism / structural-portability regression tests for closures, comprehensions, nested defs, lambdas, object-dtype ndarrays
  • collision tests for OrderedDict order-sensitivity, distinct method bindings
  • cycle-detection coverage (self-referential dicts, lists, pydantic models, shared sub-objects not false-flagged as cycles)
  • variadic tokenize signature semantics

Out of scope

  • BaseModel.model_token property, frozen-model Merkle caching, __ccflow_tokenize__ extension hook — slated for a follow-up PR
  • Streaming hash for large numpy arrays — perf follow-up

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

Test Results

836 tests  +107   834 ✅ +107   1m 45s ⏱️ +5s
  1 suites ±  0     2 💤 ±  0 
  1 files   ±  0     0 ❌ ±  0 

Results for commit 6977e5b. ± Comparison against base commit 43eb0c5.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 95.79968% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.34%. Comparing base (43eb0c5) to head (6977e5b).

Files with missing lines Patch % Lines
ccflow/tests/utils/test_tokenize.py 94.40% 25 Missing ⚠️
ccflow/utils/tokenize.py 99.41% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #208      +/-   ##
==========================================
+ Coverage   95.33%   95.34%   +0.01%     
==========================================
  Files         142      142              
  Lines       10714    11327     +613     
  Branches      613      617       +4     
==========================================
+ Hits        10214    10800     +586     
- Misses        374      399      +25     
- Partials      126      128       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ptomecek ptomecek force-pushed the pit/native-tokenize branch 2 times, most recently from 3a9b80e to c312f6f Compare May 12, 2026 21:28
Comment thread ccflow/tests/utils/test_tokenize.py Outdated
@ptomecek ptomecek force-pushed the pit/native-tokenize branch 2 times, most recently from 4dfb495 to 3c4d2ea Compare May 12, 2026 21:42
Replaces the dask-based normalize_token / tokenize re-exports with a
native singledispatch engine. Adds structural handlers for common types
(stdlib, datetime, Decimal, UUID, pathlib, Enum, partial, MappingProxy,
methods, OrderedDict, code objects, functions, types, pydantic BaseModel)
and lazy-registered handlers for numpy. Unknown types fall back to a
cloudpickle digest, raising TypeError when pickling fails.

The public tokenize() retains its variadic (*args, **kwargs) signature
for drop-in compatibility. compute_data_token, compute_cache_token, and
compute_behavior_token from PR #196 are unchanged.

Cycle detection via a module-level ContextVar prevents RecursionError on
self-referential structures and emits a stable __cycle__ marker.

Drops dask from runtime dependencies; adds cloudpickle.

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
@ptomecek ptomecek force-pushed the pit/native-tokenize branch from 3c4d2ea to 6977e5b Compare May 12, 2026 21:56
@ptomecek ptomecek marked this pull request as ready for review May 12, 2026 21:59
@ptomecek ptomecek requested review from feussy and hintse as code owners May 12, 2026 21:59
@timkpaine timkpaine merged commit 997827f into main May 13, 2026
12 checks passed
@timkpaine timkpaine deleted the pit/native-tokenize branch May 13, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants