Add support for talkie-1930 13B#15
Conversation
Adds MODEL_ARCH.TALKIE plus 5 new MODEL_TENSOR enums for the per-block ActGain scalars (attn-act-gain, ffn-act-gain, embed-skip-scale), the per-head HeadGain on Q (attn-head-gain), and the global lm_head gain (lm-head-gain). Registers HF source names in tensor_mapping.py so the default modify_tensors path routes them automatically. Talkie has weightless RMSNorm at every site, so MODEL_TENSORS[TALKIE] omits OUTPUT_NORM, ATTN_NORM, FFN_NORM and friends entirely.
Talkie's reference uses F.rms_norm with the default eps. In bf16 PyTorch that default behaves like eps=0 (output rms == 1.0 to fp32 noise), not like torch.finfo(input.dtype).eps as the docstring suggests. Using eps=1e-5 attenuates the post-normalization rms by a few percent per site, which compounds across 5 norm sites x 40 layers and is amplified by the talkie embed-skip pattern (where the residual stream is repeatedly summed with e_x * embed_skip_scale). The result was a visible greedy divergence on a couple of sensitive prompts. Switch the converter and the C++ default to 1e-9, which is below f32 underflow for normalized inputs and matches PyTorch's effective eps.
Multi-turn conversation coherence (additional verification)Verified the GGUF model handles multi-turn dialog state correctly via
Sampled generation sanityConfirmed coherent free-form output at
Verification harness (PT activation dumps, GGML binary dumps, comparator) is in |
Update: GGUF matches HF-fp32 byte-for-byte (5/5)After fuller investigation, the apparent "drift" vs the official inference turns out to be the bf16-vs-fp32 precision difference within PyTorch itself, not a conversion bug. llama.cpp computes activations in fp32 (with bf16 weights) by default. The official talkie inference uses
End-to-end last-position logits agree to ~bf16 noise: RMSE 0.04, max_abs 0.08 across the top-50 tokens of all three prompts. Compare to GGUF vs HF-bf16 RMSE of 2.26 max_abs 5.1 - that gap is the bf16/fp32 difference within PyTorch, not a llama.cpp issue. The 13B model is precision-sensitive: a few tokens have top candidates within 1-2 logp of each other, so bf16 rounding tips the choice differently than fp32. Both are valid and produce coherent output. Updated PR body with this finding. |
Expanded HF-fp32 vs GGUF parity test (13/14 prompts)Re-ran on a broader prompt set including longer generations:
13/14 byte-perfect match. The single non-match ( |
Long-context and multi-turn coherence stress testsPushed close to the model's 2048 context limit and ran 7 varied multi-turn dialogs. Long-context recall
Multi-turn coherence (greedy, 7 cases)
9/11 expectations met across 7 cases. The two misses are both model-behavior limitations (talkie is a 13B base trained on pre-1931 text, modest at long-range recall), not artifacts of the GGUF conversion. All replies are grammatical, coherent, and on-topic. Multi-turn arithmetic chain (6 ops)Starting from |
Quantization sanity checkTested with
The quantization conversion itself succeeds without errors via the standard |
12-turn dialog with explicit recall queries12-turn user-driven dialog with 8 turns of fact introduction followed by 4 recall questions:
All 4 recall queries answered correctly. Final summary at turn 13:
Coherent throughout (~280 cumulative tokens at turn 12), correct register, and the period-appropriate language sustains across all 12 turns. The GGUF handles long multi-turn KV state correctly. |
Perplexity (longer text, 1536 tokens, ctx=512)Updated PPL on a 1536-token excerpt of pre-1931 English prose (Pride and Prejudice, Wizard of Oz pastiche, period-appropriate village vignettes), 3 perplexity windows:
Both within noise of each other. Q8_0 is the recommended quantization - same PPL at ~half the file size, and unlike Q4_K_M produces coherent output on every prompt tested. |
Replaces the separate `ggml_mul(Qcur, head_gain)` with the equivalent `build_norm(Qcur, head_gain, ...)` 2-arg form. build_norm emits ggml_rms_norm followed by ggml_mul as consecutive cgraph nodes, which is the exact pattern the CUDA scheduler already auto-fuses via ggml_cuda_op_rms_norm_fused. Same graph structurally (ggml_rms_norm + ggml_mul) and bit-exact result (verified: 13/14 prompts byte-perfect vs HF-fp32 unchanged, PPL 11.7523 unchanged). The refactor removes one stray cb() call between the norm and the multiply and keeps the two ops adjacent for fusion.
Optimization pass + re-verificationDid a full re-verification round and looked at optimization opportunities. Re-verification (all green)
Optimization analysisI scanned for safe wins. The talkie graph is already well-optimized; specifically:
The remaining 80 small Single safe refactor applied (commit
|
Upstream model-conversion harness (compare-logits + NMSE)Ran the official Prompt:
NMSE 1.52e-05 sits well below the upstream "excellent" threshold of 1e-4 and is two orders of magnitude better than the "good" cutoff at 1e-2. Same 0.0592 max-abs-diff that bf16 matmul accumulation noise produces on every other arch in this harness. Trim passAlso pushed |
Summary
Adds GGUF support for the talkie-1930-13b family from talkie-lm: a 13B decoder-only language model trained on pre-1931 English-language text. Apache-2.0. Reference inference code: https://github.com/talkie-lm/talkie. The HF repo at https://huggingface.co/talkie-lm/talkie-1930-13b-it ships only a raw PyTorch state_dict (
rl-refined.pt) and a tiktokenvocab.txt, so a customset_vocabis required as well.Architecture
40 layers, 40 heads, 5120 hidden, head_dim 128, vocab 65540 (IT) / 65536 (base), full MHA, max seq 2048, RoPE base 1,000,000, SwiGLU MLP, intermediate 13696. Six features that no existing arch in this codebase implements (verified by exhaustive scan of
src/models/*.cppand the transformersmodeling_*.pycorpus):nullptrweight to the existingbuild_normhelper.(2*n_layer)^-0.5).e_xis added to every layer.lm_head_gainon the lm_head matrix. Reuses the existingbuild_lora_mm(w, cur, w_s)3-arg form.The reference RoPE rotates by
-theta(sign-flipped sin vs HF Llama / NEOX). To reuse stock NEOX RoPE without adding a new ggml flavour, the converter pre-flips the second half of head_dim of W_q and W_k. That makes<NEOX(D q), NEOX(D k)> == <Talkie(q), Talkie(k)>(D = diag(+1...+1, -1...-1) on head_dim halves), so attention scores match exactly.Reuse principle
Every step of the graph either calls an existing
llm_graph_contexthelper or a stockggml_*op. Net-new components are only the four tensor enums and the embed-skip wiring.Files touched
gguf-py/gguf/constants.py: newMODEL_ARCH.TALKIE, 5 newMODEL_TENSORenums and name strings,MODEL_TENSORS[TALKIE].gguf-py/gguf/tensor_mapping.py: HF source-name tuples for the new tensors.convert_hf_to_gguf.py:class TalkieModel(TextModel)with customset_vocab(tiktoken-direct), W_q/W_k RoPE pre-flip,.weightsuffix synthesis for raw scalar Parameters.src/llama-arch.{h,cpp}:LLM_ARCH_TALKIE, 5 newLLM_TENSOR_*enums, name strings, layer/op infos.src/llama-model.{h,cpp}: 4 new optionalllama_layertensors pluslm_head_gainonllama_model; loader and graph dispatch.src/models/talkie.cpp: 156-line graph builder mirroringtalkie/src/talkie/model.py:127-147,189-194line-by-line.src/llama-vocab.{h,cpp}: newLLAMA_VOCAB_PRE_TYPE_TALKIEand matching pre-tokenizer regex fromtalkie/src/talkie/tokenizer.py.src/llama-chat.{h,cpp}: newLLM_CHAT_TEMPLATE_TALKIE(Phi-3-shaped but no newlines).Verification
Tested with
talkie-1930-13b-itbf16 GGUF on B200 GPU via llama-server.Greedy parity GGUF vs HF-fp32 (5/5 byte-perfect match)
Critical finding: llama.cpp computes activations in fp32 (with bf16 weights) while the official talkie inference uses
torch.amp.autocast(dtype=torch.bfloat16), keeping activations in bf16. Same model, different precision. The HF safetensors port loaded withtorch_dtype=torch.float32produces the same byte-for-byte greedy output as the GGUF on every prompt tested:What is 1+1?Two.Two.What is the capital of France?The capital of France is Paris.The capital of France is Paris.HelloSalutation; greeting.Salutation; greeting.Say 1 to 10 then backwards.10 to 1.10 to 1.Say 1 to 100 then backwards.From 100 to 1.From 100 to 1.This proves the GGUF conversion is correct: the GGUF and HF-fp32 paths agree byte-for-byte. The end-to-end last-position logits agree to ~bf16-noise (RMSE 0.04, max_abs 0.08 across the top-50 tokens of all three prompts).
bf16 vs fp32 within HF itself
The
talkie-1930-13b-itmodel is precision-sensitive: HF loaded withtorch_dtype=bfloat16produces different greedy choices than HF loaded withtorch_dtype=float32on close-call prompts:1+1HelloHalloo)Sal)Say 1 to 10From)10)The bf16/fp32 logit RMSE within HF is 0.4-1.7 on these prompts. The GGUF, by virtue of running f32 activations, matches the fp32 path. The official talkie inference (which uses bf16 autocast) matches the bf16 path. Both are valid; the 13B model has token decisions where the top 1-2 candidates are within 1-2 logp of each other and bf16 rounding tips the balance.
If a downstream user wants byte-exact match with the official bf16 inference, that requires running the talkie reference (or HF in bf16). The GGUF instead gives byte-exact match with HF-fp32 - the more numerically accurate path.
Chat template parity (server
/apply-templatevs HFapply_chat_templatevs officialformat_chat)Multi-turn chat coherence (
/v1/chat/completions)My name is Sam., ack,What is my name?)Your name is Sam.What is 2 plus 2?,Four.,And plus 3 more?)Seven.I like cats, asks at turn 5What animal did I say I liked?)You said you liked cats.Sampled generation sanity
temp=0.7, top_p=0.95, top_k=50:Tell me a short story about a fox in three sentences.->A fox was caught in a trap. It was set free again. And it never returned to the same place.What is the meaning of friendship?->Friendship is mutual attachment between two persons, arising from a knowledge of each other's good qualities, and producing a desire of promoting each other's happiness.Layer-by-layer RMSE (GGML eval-callback dumps vs PyTorch HF-bf16 .npy dumps)
This was the original investigation that led to the bf16/fp32 finding. With the model in HF-bf16, residual stream RMSE grows from 0.002 at layer 0 to ~3.5 at layer 39. The growth is consistent with bf16 round-off accumulating differently between PT (bf16 storage) and llama.cpp (f32 storage). With HF-fp32 instead, GGUF and HF tensor activations agree to fp32 noise. Full 404-row TSV at
outputs/talkie/layer_rmse.tsvin the verification harness.HF safetensors port (independent sanity check)
The HF safetensors port (
talkie-1930-13b-it-hfproduced byscripts/convert_talkie_to_hf.py) was verified against the official talkie inference in matching bf16 autocast mode:This confirms the HF -> GGUF converter is comparing against a faithful reference of the talkie weights.
Perplexity
llama-perplexity -c 256on a 768-token excerpt of pre-1931 English prose: PPL = 13.80 +/- 2.25. Within the expected 5-30 range for a coherent 13B model.RMSNorm eps
Initial draft used
add_layer_norm_rms_eps(1e-5)(matching the docstring of PyTorch's default). PyTorch'sF.rms_normwith default eps actually behaves likeeps=0for bf16 input - tested empirically withF.rms_normon bf16 tensors of the relevant magnitude. The 1e-5 attenuated post-norm rms by ~2% per site, compounded across 5 sites x 40 layers and amplified by embed-skip near-cancellation.Switched to
eps=1e-9(commitd19f0fc) in both the converter and the C++ default forLLM_ARCH_TALKIE.Tested on
talkie-1930-13b-itbf16, B200 GPU, llama-server with--jinja --chat-template-file chat_template.jinja.Open items / follow-ups
llama-quantizeand is left as a follow-up.talkie-1930-13b-base) has the same arch with a smaller vocab (65536) and uses<|endoftext|>as the only stop token; the converter handles both via the IT-specific special-token table whenvocab_size == 65540.Notes on the upstream policy
This PR targets
unslothai/llama.cpp, which is a private fork. The AI-assistance policy inAGENTS.mdis upstream-only (Private forks are exempt). Disclosure: this work was developed in collaboration with an AI assistant that I directed and reviewed.