Native tokenization engine, drop dask runtime dep#208
Merged
Conversation
Contributor
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #208 +/- ##
==========================================
+ Coverage 95.33% 95.34% +0.01%
==========================================
Files 142 142
Lines 10714 11327 +613
Branches 613 617 +4
==========================================
+ Hits 10214 10800 +586
- Misses 374 399 +25
- Partials 126 128 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3a9b80e to
c312f6f
Compare
timkpaine
reviewed
May 12, 2026
4dfb495 to
3c4d2ea
Compare
Replaces the dask-based normalize_token / tokenize re-exports with a native singledispatch engine. Adds structural handlers for common types (stdlib, datetime, Decimal, UUID, pathlib, Enum, partial, MappingProxy, methods, OrderedDict, code objects, functions, types, pydantic BaseModel) and lazy-registered handlers for numpy. Unknown types fall back to a cloudpickle digest, raising TypeError when pickling fails. The public tokenize() retains its variadic (*args, **kwargs) signature for drop-in compatibility. compute_data_token, compute_cache_token, and compute_behavior_token from PR #196 are unchanged. Cycle detection via a module-level ContextVar prevents RecursionError on self-referential structures and emits a stable __cycle__ marker. Drops dask from runtime dependencies; adds cloudpickle. Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
3c4d2ea to
6977e5b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the
dask.base.normalize_token/tokenizere-exports with a native singledispatch engine, and dropsdaskfrom runtime dependencies (addscloudpickle).Builds on #196 —
compute_data_token,compute_cache_token, andcompute_behavior_tokenkeep their existing semantics; the underlying data-tokenization backend changes from dask to the native engine.What's covered
Structural handlers for: stdlib primitives,
datetime/date/time/timedelta,Decimal,UUID,pathlib,Enum,complex,Ellipsis,slice,functools.partial,MappingProxyType,OrderedDict(order-preserving),types.MethodType/MethodWrapperType,types.CodeType,types.FunctionType(defaults / kwdefaults / closure cells participate),type, andpydantic.BaseModel. Lazy-registered handlers fornumpy.ndarray(object-dtype arrays recurse element-wise) andnumpy.generic. Unknown types fall back tocloudpickle, raisingTypeErrorif pickling fails.tokenize()keeps its variadic(*args, **kwargs)signature for drop-in compatibility with the dask API.Cycle detection via a module-level
ContextVarreturns a stable(\"__cycle__\", type_name)marker instead of raisingRecursionError.Performance vs dask
Rough one-off comparisons (μs/op):
Wins are concentrated where ccflow actually computes cache keys: pydantic models and their compositions (~220x), small dicts, and functions. Losses are on bulk numpy / pandas data — dask streams a contiguous view directly into the hasher and has fast paths for string-dtype arrays; this implementation does `tobytes() → repr() → encode() → sha256` and recurses element-wise for object arrays. Worth a follow-up if multi-MB arrays become typical cache inputs.
Tests
818 passing (+128 new), including:
OrderedDictorder-sensitivity, distinct method bindingstokenizesignature semanticsOut of scope
BaseModel.model_tokenproperty, frozen-model Merkle caching,__ccflow_tokenize__extension hook — slated for a follow-up PR