SAM Tokenizer β AMFORGE/sam_tokenizer
Official tokenizer for SAM (Structured Action Model) by AMFORGE. Built on NexusBPE, AMEFORGE's in-house tokenization architecture designed for structured action generation across heterogeneous domains.
What it does
A single tokenizer that handles 10 production domains with uniform quality β robotics, HTTP / REST APIs, MQTT / IoT messaging, databases, workflow orchestration, e-commerce, autonomous vehicles, smart home, calendar / email, and filesystem operations.
Why it matters
Generic LLM tokenizers shred coordinates and identifiers into fragments:
0.5 β ['0', '.', '5'] (3 tokens)
-1.2 β ['-', '1', '.', '2'] (4 tokens)
8080 β ['8', '0', '80'] (3 tokens)
This destroys numeric precision, balloons sequence length, and forces the model to learn arithmetic from character soup. NexusBPE keeps these atomic by construction, while still compressing prose efficiently.
| Generic tokenizer | NexusBPE | |
|---|---|---|
move to x=0.5 y=-1.2 z=0.8 |
~16 tokens | ~6 tokens |
POST /api/v1/orders |
~8 tokens | ~3 tokens |
GET /users β 404 |
~6 tokens | ~3 tokens |
Lower sequence length β lower latency, lower memory, sharper attention on the parts that matter.
Highlights
- Vocab size: 12000
- Atomic guarantees: every coordinate, status code, port, frequency, and angle in the supported ranges encodes to a single token
- Domain coverage: 10 first-class domains via dedicated marker tokens
- Schema-conditioned: native support for JSON Schema in-context conditioning
- Reversible: bit-perfect roundtrip on all structured payloads
- Deterministic: identical input β identical token IDs across runs
- Compact: ~3Γ shorter sequences than generic LLM tokenizers on agentic tasks
Loading
The tokenizer ships as a binary model file. Load it via the lightweight NexusBPE wrapper:
from huggingface_hub import hf_hub_download
class NexusBPE:
"""Minimal loader for SAM / NexusBPE tokenizers."""
def __init__(self, model_path: str):
import sentencepiece as _spm # implementation detail
self._sp = _spm.SentencePieceProcessor(); self._sp.Load(model_path)
self.vocab_size = self._sp.GetPieceSize()
self.pad_id = self._sp.pad_id(); self.eos_id = self._sp.eos_id()
def encode(self, text: str) -> list[int]:
return self._sp.EncodeAsIds(text)
def decode(self, ids) -> str:
return self._sp.DecodeIds(list(ids))
path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model")
tok = NexusBPE(path)
ids = tok.encode('<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK>')
print(f"Tokens: {len(ids)}")
print(f"Roundtrip: {tok.decode(ids)}")
Domain markers
The tokenizer reserves marker tokens for each supported domain so the model can condition its output on the active domain:
| Marker | Purpose |
|---|---|
<ROS> |
Robotics (ROS / ROS2) |
<HTTP> |
HTTP / REST APIs |
<MQTT> |
MQTT / IoT messaging |
<DB> |
Databases (SQL / NoSQL / Redis) |
<WORKFLOW> |
Workflow orchestration |
<ECOMMERCE> |
E-commerce |
<VEHICLE> |
Autonomous vehicles |
<HOME> |
Smart home |
<CAL> |
Calendar / email |
<FILE> |
Filesystem |
Plus structural markers β <SCHEMA>, <TASK>, <JSON>, <ACTION>,
<META> β for schema-conditioned prompting.
Used by
AMFORGE/sam-v1β the SAM model
License
APACHE-2.0. Free for research and commercial use. Attribution appreciated.
Citation
@misc{sam_tokenizer_2026,
title = {SAM Tokenizer: NexusBPE for Multi-Domain Structured Action Generation},
author = {AMFORGE},
year = {2026},
url = {https://huggingface.co/AMFORGE/sam_tokenizer}
}
Built with NexusBPE by AMFORGE β https://huggingface.co/AMFORGE