Qianfan-OCR MLX 4-bit
Optimized for Apple Silicon (M1/M2/M3/M4)
π€ Original Model | π Technical Report | π» GitHub | π MLX-VLM
Introduction
This is a 4-bit quantized version of Qianfan-OCR optimized for Apple Silicon using the MLX framework. It delivers 2x faster generation speed with half the memory footprint while maintaining full OCR accuracy.
Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by Baidu Qianfan Team, achieving #1 ranking on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8) among end-to-end models.
Why MLX 4-bit?
| Metric | Original (bfloat16) | MLX 4-bit | Improvement |
|---|---|---|---|
| Model Size | 9.5GB | 2.9GB | -69% π |
| Prefill Speed | ~1,250 tok/s | ~1,252 tok/s | Maintained |
| Generation Speed | ~65-69 tok/s | 145 tok/s | +111% π |
| Peak Memory | ~10.6GB | 4.7GB | -56% πΎ |
| OCR Accuracy | Perfect | Perfect | No Loss β |
Benchmarked on Apple Silicon Mac with mlx-vlm
Key Features
- β Zero Code Changes Required - Works directly with existing mlx-vlm implementation
- β Production-Ready Performance - 145 tokens/sec generation on Apple Silicon
- β Memory Efficient - Runs comfortably on 8GB unified memory
- β Full Feature Support - All Qianfan-OCR capabilities including Layout-as-Thought
- β 192 Languages - Complete multilingual OCR support
Supported Tasks
All tasks from the original Qianfan-OCR model are fully supported:
- Document Parsing - Image-to-Markdown conversion, multi-page parsing
- Layout Analysis - Bounding box detection, element classification (25 categories)
- Table Recognition - Complex tables with merged cells, HTML output
- Formula Recognition - LaTeX output for inline and display math
- Chart Understanding - Chart QA, trend analysis, data extraction
- Key Information Extraction - Receipts, invoices, certificates, medical records
- Handwriting Recognition - Chinese and English handwritten text
- Scene Text Recognition - Street signs, product labels
- Multilingual OCR - 192 languages including CJK, Arabic, Cyrillic, etc.
Installation
Prerequisites
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- mlx-vlm
Install MLX-VLM
pip install mlx-vlm
Quick Start
Basic Document Parsing
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load 4-bit quantized model
model, processor = load("jason1966/Qianfan-OCR-MLX-4bit", trust_remote_code=True)
config = load_config("jason1966/Qianfan-OCR-MLX-4bit")
# Process image
image = ["your_document.png"]
prompt = "Parse this document to Markdown."
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
# Generate
output = generate(model, processor, formatted_prompt, image, max_tokens=2000)
print(output)
Command Line Usage
python -m mlx_vlm.generate \
--model jason1966/Qianfan-OCR-MLX-4bit \
--max-tokens 2000 \
--prompt "Parse this document to Markdown." \
--image your_document.png \
--trust-remote-code
Layout-as-Thought (Thinking Mode)
Enable structured layout analysis by adding <think> to your prompt:
prompt = "Parse this document to Markdown.<think>"
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
output = generate(model, processor, formatted_prompt, ["complex_doc.jpg"], max_tokens=2000)
The model will first generate structured layout analysis (bounding boxes, element types, reading order), then produce the final Markdown output.
Key Information Extraction
prompt = "Extract the following fields from the image: Name, Date, Total Amount. Output in standard JSON format."
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
output = generate(model, processor, formatted_prompt, ["invoice.jpg"], max_tokens=2000)
Performance Benchmarks
Speed Comparison (Apple Silicon)
| Operation | Original Model | MLX 4-bit | Speedup |
|---|---|---|---|
| Prefill (prompt processing) | 1,250 tok/s | 1,252 tok/s | 1.00x |
| Generation (output) | 65-69 tok/s | 145 tok/s | 2.11x |
| End-to-End (real-world) | - | - | ~2x faster |
Memory Usage
| Model Variant | Disk Size | Peak Memory | Min. Unified Memory |
|---|---|---|---|
| Original (bfloat16) | 9.5GB | 10.6GB | 16GB recommended |
| MLX 4-bit | 2.9GB | 4.7GB | 8GB sufficient |
Accuracy Verification
We tested the 4-bit model on diverse documents:
| Test Case | Result |
|---|---|
| English technical document | β Perfect - All text, formulas, and tables correctly parsed |
| Chinese invoice | β Perfect - All fields, amounts, and dates extracted accurately |
| Complex multi-column layout | β Perfect - Reading order and structure preserved |
| Handwritten notes | β Perfect - Same quality as original model |
Conclusion: 4-bit quantization achieves lossless OCR accuracy while delivering 2x performance improvement.
Model Architecture
This model inherits the architecture from Qianfan-OCR:
| Component | Details |
|---|---|
| Vision Encoder | InternViT-6B (24 layers, 1024 hidden dim, 448Γ448 patches) |
| Language Model | Qwen3-4B (36 layers, 2560 hidden dim, GQA 32/8 heads) |
| Cross-Modal Adapter | 2-layer MLP with GELU (1024β2560 dim) |
| Total Parameters | ~4.3B |
| Quantization | 4-bit with 5.239 bits per weight (group size optimization) |
| Vocabulary | 153,678 tokens (includes 1000 coordinate tokens <COORD_000>-<COORD_999>) |
Dynamic Resolution
- Base tile size: 448Γ448
- Dynamic patches: 1-12 tiles per image
- Thumbnail support for multi-tile images
- 256 visual tokens per tile (after pixel shuffle downsampling)
Technical Details
Quantization Method
- Technique: MLX 4-bit weight quantization
- Actual Precision: 5.239 bits per weight (better than pure 4-bit)
- Quantization Tool:
mlx_vlm.convert --quantize - Size Reduction: 9.5GB β 2.9GB (69% compression)
MLX Framework Benefits
- Unified Memory: Leverages Apple Silicon's shared GPU/CPU memory architecture
- Metal Acceleration: Native GPU acceleration via Metal API
- Zero-Copy Operations: Efficient memory usage without CPUβGPU transfers
- Lazy Evaluation: Optimized computation graphs
- Native Integration: First-class support for Apple hardware features
Why No Code Changes?
Qianfan-OCR uses the internvl_chat architecture, which mlx-vlm already fully supports:
- β
model_type: "internvl_chat"- Auto-detected by mlx-vlm - β Weight keys match exactly - Direct safetensors loading
- β
Qwen3 support - QK normalization via
attention_bias: false - β
Image processor - Compatible
<img>,</img>,<IMG_CONTEXT>tokens - β
Chat template - Automatically loaded from
chat_template.jinja
Benchmark Results (Original Model)
The base Qianfan-OCR model achieved state-of-the-art results:
OmniDocBench v1.5
- Overall Score: 93.12 (#1 among end-to-end models)
- Beats DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33)
OCR Benchmarks
- OCRBench: 880
- OlmOCR Bench: 79.8 (#1 among end-to-end models)
- CCOCR Overall: 79.3
Key Information Extraction
- Overall Mean: 87.9 (across 5 benchmarks)
- Surpasses Gemini-3.1-Pro, Qwen3-VL-235B-A22B
See original model page for full benchmark details.
Use Cases
1. Document Digitization
- Scan physical documents to editable Markdown
- Preserve complex layouts, tables, and formulas
- 145 tok/s = ~2900 words/min (assuming 20 tokens/word)
2. Invoice Processing
prompt = """Extract all fields from this invoice:
- Invoice number
- Date
- Vendor name
- Line items (description, quantity, price)
- Subtotal, tax, total
Output as JSON."""
3. Research Paper Analysis
prompt = """Parse this academic paper and:
1. Extract title, authors, abstract
2. Convert all formulas to LaTeX
3. Preserve table structures
4. Generate outline from section headings
Output in Markdown."""
4. Multi-language OCR
# Automatically detects and transcribes 192 languages
prompt = "Transcribe all text from this multilingual document."
Limitations
- Apple Silicon Only: Requires M1/M2/M3/M4 Macs with Metal support
- Python 3.10+: Older Python versions not supported by MLX
- MLX Framework: Different ecosystem from PyTorch/Transformers
- Single Image Focus: Multi-page PDF processing requires splitting into images
Citation
@misc{dong2026qianfanocrunifiedendtoendmodel,
title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
year={2026},
eprint={2603.13398},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.13398},
}
Acknowledgments
- Baidu Qianfan Team - For developing the original Qianfan-OCR model
- MLX Team (Apple) - For the efficient MLX framework
- mlx-vlm Contributors - For the excellent VLM inference library
- InternVL Team - For the foundational architecture
License
This model inherits the Apache License 2.0 from the original Qianfan-OCR model.
- Original model: baidu/Qianfan-OCR
- License: Apache-2.0
- Quantization: Performed using open-source mlx-vlm tools
Related Resources
- π¦ Original Qianfan-OCR Model
- π MLX Framework
- π§ MLX-VLM Library
- π Technical Report
- π¬ Demo
- Downloads last month
- 706
4-bit
Model tree for jason1966/Qianfan-OCR-MLX-4bit
Base model
baidu/Qianfan-OCR