Qianfan-OCR MLX 4-bit

Optimized for Apple Silicon (M1/M2/M3/M4)

🤗 Original Model | 📄 Technical Report | 💻 GitHub | 🍎 MLX-VLM

Introduction

This is a 4-bit quantized version of Qianfan-OCR optimized for Apple Silicon using the MLX framework. It delivers 2x faster generation speed with half the memory footprint while maintaining full OCR accuracy.

Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by Baidu Qianfan Team, achieving #1 ranking on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8) among end-to-end models.

Why MLX 4-bit?

Metric	Original (bfloat16)	MLX 4-bit	Improvement
Model Size	9.5GB	2.9GB	-69% 🎉
Prefill Speed	~1,250 tok/s	~1,252 tok/s	Maintained
Generation Speed	~65-69 tok/s	145 tok/s	+111% 🚀
Peak Memory	~10.6GB	4.7GB	-56% 💾
OCR Accuracy	Perfect	Perfect	No Loss ✅

Benchmarked on Apple Silicon Mac with mlx-vlm

Key Features

✅ Zero Code Changes Required - Works directly with existing mlx-vlm implementation
✅ Production-Ready Performance - 145 tokens/sec generation on Apple Silicon
✅ Memory Efficient - Runs comfortably on 8GB unified memory
✅ Full Feature Support - All Qianfan-OCR capabilities including Layout-as-Thought
✅ 192 Languages - Complete multilingual OCR support

Supported Tasks

All tasks from the original Qianfan-OCR model are fully supported:

Document Parsing - Image-to-Markdown conversion, multi-page parsing
Layout Analysis - Bounding box detection, element classification (25 categories)
Table Recognition - Complex tables with merged cells, HTML output
Formula Recognition - LaTeX output for inline and display math
Chart Understanding - Chart QA, trend analysis, data extraction
Key Information Extraction - Receipts, invoices, certificates, medical records
Handwriting Recognition - Chinese and English handwritten text
Scene Text Recognition - Street signs, product labels
Multilingual OCR - 192 languages including CJK, Arabic, Cyrillic, etc.

Installation

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
mlx-vlm

Install MLX-VLM

pip install mlx-vlm

Quick Start

Basic Document Parsing

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load 4-bit quantized model
model, processor = load("jason1966/Qianfan-OCR-MLX-4bit", trust_remote_code=True)
config = load_config("jason1966/Qianfan-OCR-MLX-4bit")

# Process image
image = ["your_document.png"]
prompt = "Parse this document to Markdown."
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

# Generate
output = generate(model, processor, formatted_prompt, image, max_tokens=2000)
print(output)

Command Line Usage

python -m mlx_vlm.generate \
  --model jason1966/Qianfan-OCR-MLX-4bit \
  --max-tokens 2000 \
  --prompt "Parse this document to Markdown." \
  --image your_document.png \
  --trust-remote-code

Layout-as-Thought (Thinking Mode)

Enable structured layout analysis by adding <think> to your prompt:

prompt = "Parse this document to Markdown.<think>"
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
output = generate(model, processor, formatted_prompt, ["complex_doc.jpg"], max_tokens=2000)

The model will first generate structured layout analysis (bounding boxes, element types, reading order), then produce the final Markdown output.

Key Information Extraction

prompt = "Extract the following fields from the image: Name, Date, Total Amount. Output in standard JSON format."
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
output = generate(model, processor, formatted_prompt, ["invoice.jpg"], max_tokens=2000)

Performance Benchmarks

Speed Comparison (Apple Silicon)

Operation	Original Model	MLX 4-bit	Speedup
Prefill (prompt processing)	1,250 tok/s	1,252 tok/s	1.00x
Generation (output)	65-69 tok/s	145 tok/s	2.11x
End-to-End (real-world)	-	-	~2x faster

Memory Usage

Model Variant	Disk Size	Peak Memory	Min. Unified Memory
Original (bfloat16)	9.5GB	10.6GB	16GB recommended
MLX 4-bit	2.9GB	4.7GB	8GB sufficient

Accuracy Verification

We tested the 4-bit model on diverse documents:

Test Case	Result
English technical document	✅ Perfect - All text, formulas, and tables correctly parsed
Chinese invoice	✅ Perfect - All fields, amounts, and dates extracted accurately
Complex multi-column layout	✅ Perfect - Reading order and structure preserved
Handwritten notes	✅ Perfect - Same quality as original model

Conclusion: 4-bit quantization achieves lossless OCR accuracy while delivering 2x performance improvement.

Model Architecture

This model inherits the architecture from Qianfan-OCR:

Component	Details
Vision Encoder	InternViT-6B (24 layers, 1024 hidden dim, 448×448 patches)
Language Model	Qwen3-4B (36 layers, 2560 hidden dim, GQA 32/8 heads)
Cross-Modal Adapter	2-layer MLP with GELU (1024→2560 dim)
Total Parameters	~4.3B
Quantization	4-bit with 5.239 bits per weight (group size optimization)
Vocabulary	153,678 tokens (includes 1000 coordinate tokens `<COORD_000>`-`<COORD_999>`)

Dynamic Resolution

Base tile size: 448×448
Dynamic patches: 1-12 tiles per image
Thumbnail support for multi-tile images
256 visual tokens per tile (after pixel shuffle downsampling)

Technical Details

Quantization Method

Technique: MLX 4-bit weight quantization
Actual Precision: 5.239 bits per weight (better than pure 4-bit)
Quantization Tool: mlx_vlm.convert --quantize
Size Reduction: 9.5GB → 2.9GB (69% compression)

MLX Framework Benefits

Unified Memory: Leverages Apple Silicon's shared GPU/CPU memory architecture
Metal Acceleration: Native GPU acceleration via Metal API
Zero-Copy Operations: Efficient memory usage without CPU↔GPU transfers
Lazy Evaluation: Optimized computation graphs
Native Integration: First-class support for Apple hardware features

Why No Code Changes?

Qianfan-OCR uses the internvl_chat architecture, which mlx-vlm already fully supports:

✅ model_type: "internvl_chat" - Auto-detected by mlx-vlm
✅ Weight keys match exactly - Direct safetensors loading
✅ Qwen3 support - QK normalization via attention_bias: false
✅ Image processor - Compatible <img>, </img>, <IMG_CONTEXT> tokens
✅ Chat template - Automatically loaded from chat_template.jinja

Benchmark Results (Original Model)

The base Qianfan-OCR model achieved state-of-the-art results:

OmniDocBench v1.5

Overall Score: 93.12 (#1 among end-to-end models)
Beats DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33)

OCR Benchmarks

OCRBench: 880
OlmOCR Bench: 79.8 (#1 among end-to-end models)
CCOCR Overall: 79.3

Key Information Extraction

Overall Mean: 87.9 (across 5 benchmarks)
Surpasses Gemini-3.1-Pro, Qwen3-VL-235B-A22B

See original model page for full benchmark details.

Use Cases

1. Document Digitization

Scan physical documents to editable Markdown
Preserve complex layouts, tables, and formulas
145 tok/s = ~2900 words/min (assuming 20 tokens/word)

2. Invoice Processing

prompt = """Extract all fields from this invoice:
- Invoice number
- Date
- Vendor name
- Line items (description, quantity, price)
- Subtotal, tax, total
Output as JSON."""

3. Research Paper Analysis

prompt = """Parse this academic paper and:
1. Extract title, authors, abstract
2. Convert all formulas to LaTeX
3. Preserve table structures
4. Generate outline from section headings
Output in Markdown."""

4. Multi-language OCR

# Automatically detects and transcribes 192 languages
prompt = "Transcribe all text from this multilingual document."

Limitations

Apple Silicon Only: Requires M1/M2/M3/M4 Macs with Metal support
Python 3.10+: Older Python versions not supported by MLX
MLX Framework: Different ecosystem from PyTorch/Transformers
Single Image Focus: Multi-page PDF processing requires splitting into images

Citation

@misc{dong2026qianfanocrunifiedendtoendmodel,
  title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
  author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
  year={2026},
  eprint={2603.13398},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.13398},
}

Acknowledgments

Baidu Qianfan Team - For developing the original Qianfan-OCR model
MLX Team (Apple) - For the efficient MLX framework
mlx-vlm Contributors - For the excellent VLM inference library
InternVL Team - For the foundational architecture

License

This model inherits the Apache License 2.0 from the original Qianfan-OCR model.

Original model: baidu/Qianfan-OCR
License: Apache-2.0
Quantization: Performed using open-source mlx-vlm tools

Related Resources

Downloads last month: 706

Safetensors

Model size

1.0B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jason1966/Qianfan-OCR-MLX-4bit

Base model

baidu/Qianfan-OCR

Quantized

(5)

this model

Space using jason1966/Qianfan-OCR-MLX-4bit 1

Paper for jason1966/Qianfan-OCR-MLX-4bit

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Paper • 2603.13398 • Published about 1 month ago • 153