Qianfan-OCR MLX 4-bit

Optimized for Apple Silicon (M1/M2/M3/M4)

πŸ€— Original Model | πŸ“„ Technical Report | πŸ’» GitHub | 🍎 MLX-VLM

Introduction

This is a 4-bit quantized version of Qianfan-OCR optimized for Apple Silicon using the MLX framework. It delivers 2x faster generation speed with half the memory footprint while maintaining full OCR accuracy.

Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by Baidu Qianfan Team, achieving #1 ranking on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8) among end-to-end models.

Why MLX 4-bit?

Metric Original (bfloat16) MLX 4-bit Improvement
Model Size 9.5GB 2.9GB -69% πŸŽ‰
Prefill Speed ~1,250 tok/s ~1,252 tok/s Maintained
Generation Speed ~65-69 tok/s 145 tok/s +111% πŸš€
Peak Memory ~10.6GB 4.7GB -56% πŸ’Ύ
OCR Accuracy Perfect Perfect No Loss βœ…

Benchmarked on Apple Silicon Mac with mlx-vlm

Key Features

  • βœ… Zero Code Changes Required - Works directly with existing mlx-vlm implementation
  • βœ… Production-Ready Performance - 145 tokens/sec generation on Apple Silicon
  • βœ… Memory Efficient - Runs comfortably on 8GB unified memory
  • βœ… Full Feature Support - All Qianfan-OCR capabilities including Layout-as-Thought
  • βœ… 192 Languages - Complete multilingual OCR support

Supported Tasks

All tasks from the original Qianfan-OCR model are fully supported:

  • Document Parsing - Image-to-Markdown conversion, multi-page parsing
  • Layout Analysis - Bounding box detection, element classification (25 categories)
  • Table Recognition - Complex tables with merged cells, HTML output
  • Formula Recognition - LaTeX output for inline and display math
  • Chart Understanding - Chart QA, trend analysis, data extraction
  • Key Information Extraction - Receipts, invoices, certificates, medical records
  • Handwriting Recognition - Chinese and English handwritten text
  • Scene Text Recognition - Street signs, product labels
  • Multilingual OCR - 192 languages including CJK, Arabic, Cyrillic, etc.

Installation

Prerequisites

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • mlx-vlm

Install MLX-VLM

pip install mlx-vlm

Quick Start

Basic Document Parsing

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load 4-bit quantized model
model, processor = load("jason1966/Qianfan-OCR-MLX-4bit", trust_remote_code=True)
config = load_config("jason1966/Qianfan-OCR-MLX-4bit")

# Process image
image = ["your_document.png"]
prompt = "Parse this document to Markdown."
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

# Generate
output = generate(model, processor, formatted_prompt, image, max_tokens=2000)
print(output)

Command Line Usage

python -m mlx_vlm.generate \
  --model jason1966/Qianfan-OCR-MLX-4bit \
  --max-tokens 2000 \
  --prompt "Parse this document to Markdown." \
  --image your_document.png \
  --trust-remote-code

Layout-as-Thought (Thinking Mode)

Enable structured layout analysis by adding <think> to your prompt:

prompt = "Parse this document to Markdown.<think>"
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
output = generate(model, processor, formatted_prompt, ["complex_doc.jpg"], max_tokens=2000)

The model will first generate structured layout analysis (bounding boxes, element types, reading order), then produce the final Markdown output.

Key Information Extraction

prompt = "Extract the following fields from the image: Name, Date, Total Amount. Output in standard JSON format."
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)
output = generate(model, processor, formatted_prompt, ["invoice.jpg"], max_tokens=2000)

Performance Benchmarks

Speed Comparison (Apple Silicon)

Operation Original Model MLX 4-bit Speedup
Prefill (prompt processing) 1,250 tok/s 1,252 tok/s 1.00x
Generation (output) 65-69 tok/s 145 tok/s 2.11x
End-to-End (real-world) - - ~2x faster

Memory Usage

Model Variant Disk Size Peak Memory Min. Unified Memory
Original (bfloat16) 9.5GB 10.6GB 16GB recommended
MLX 4-bit 2.9GB 4.7GB 8GB sufficient

Accuracy Verification

We tested the 4-bit model on diverse documents:

Test Case Result
English technical document βœ… Perfect - All text, formulas, and tables correctly parsed
Chinese invoice βœ… Perfect - All fields, amounts, and dates extracted accurately
Complex multi-column layout βœ… Perfect - Reading order and structure preserved
Handwritten notes βœ… Perfect - Same quality as original model

Conclusion: 4-bit quantization achieves lossless OCR accuracy while delivering 2x performance improvement.

Model Architecture

This model inherits the architecture from Qianfan-OCR:

Component Details
Vision Encoder InternViT-6B (24 layers, 1024 hidden dim, 448Γ—448 patches)
Language Model Qwen3-4B (36 layers, 2560 hidden dim, GQA 32/8 heads)
Cross-Modal Adapter 2-layer MLP with GELU (1024β†’2560 dim)
Total Parameters ~4.3B
Quantization 4-bit with 5.239 bits per weight (group size optimization)
Vocabulary 153,678 tokens (includes 1000 coordinate tokens <COORD_000>-<COORD_999>)

Dynamic Resolution

  • Base tile size: 448Γ—448
  • Dynamic patches: 1-12 tiles per image
  • Thumbnail support for multi-tile images
  • 256 visual tokens per tile (after pixel shuffle downsampling)

Technical Details

Quantization Method

  • Technique: MLX 4-bit weight quantization
  • Actual Precision: 5.239 bits per weight (better than pure 4-bit)
  • Quantization Tool: mlx_vlm.convert --quantize
  • Size Reduction: 9.5GB β†’ 2.9GB (69% compression)

MLX Framework Benefits

  • Unified Memory: Leverages Apple Silicon's shared GPU/CPU memory architecture
  • Metal Acceleration: Native GPU acceleration via Metal API
  • Zero-Copy Operations: Efficient memory usage without CPU↔GPU transfers
  • Lazy Evaluation: Optimized computation graphs
  • Native Integration: First-class support for Apple hardware features

Why No Code Changes?

Qianfan-OCR uses the internvl_chat architecture, which mlx-vlm already fully supports:

  1. βœ… model_type: "internvl_chat" - Auto-detected by mlx-vlm
  2. βœ… Weight keys match exactly - Direct safetensors loading
  3. βœ… Qwen3 support - QK normalization via attention_bias: false
  4. βœ… Image processor - Compatible <img>, </img>, <IMG_CONTEXT> tokens
  5. βœ… Chat template - Automatically loaded from chat_template.jinja

Benchmark Results (Original Model)

The base Qianfan-OCR model achieved state-of-the-art results:

OmniDocBench v1.5

  • Overall Score: 93.12 (#1 among end-to-end models)
  • Beats DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33)

OCR Benchmarks

  • OCRBench: 880
  • OlmOCR Bench: 79.8 (#1 among end-to-end models)
  • CCOCR Overall: 79.3

Key Information Extraction

  • Overall Mean: 87.9 (across 5 benchmarks)
  • Surpasses Gemini-3.1-Pro, Qwen3-VL-235B-A22B

See original model page for full benchmark details.

Use Cases

1. Document Digitization

  • Scan physical documents to editable Markdown
  • Preserve complex layouts, tables, and formulas
  • 145 tok/s = ~2900 words/min (assuming 20 tokens/word)

2. Invoice Processing

prompt = """Extract all fields from this invoice:
- Invoice number
- Date
- Vendor name
- Line items (description, quantity, price)
- Subtotal, tax, total
Output as JSON."""

3. Research Paper Analysis

prompt = """Parse this academic paper and:
1. Extract title, authors, abstract
2. Convert all formulas to LaTeX
3. Preserve table structures
4. Generate outline from section headings
Output in Markdown."""

4. Multi-language OCR

# Automatically detects and transcribes 192 languages
prompt = "Transcribe all text from this multilingual document."

Limitations

  • Apple Silicon Only: Requires M1/M2/M3/M4 Macs with Metal support
  • Python 3.10+: Older Python versions not supported by MLX
  • MLX Framework: Different ecosystem from PyTorch/Transformers
  • Single Image Focus: Multi-page PDF processing requires splitting into images

Citation

@misc{dong2026qianfanocrunifiedendtoendmodel,
  title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
  author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
  year={2026},
  eprint={2603.13398},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.13398},
}

Acknowledgments

  • Baidu Qianfan Team - For developing the original Qianfan-OCR model
  • MLX Team (Apple) - For the efficient MLX framework
  • mlx-vlm Contributors - For the excellent VLM inference library
  • InternVL Team - For the foundational architecture

License

This model inherits the Apache License 2.0 from the original Qianfan-OCR model.

  • Original model: baidu/Qianfan-OCR
  • License: Apache-2.0
  • Quantization: Performed using open-source mlx-vlm tools

Related Resources

Downloads last month
706
Safetensors
Model size
1.0B params
Tensor type
BF16
Β·
U32
Β·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jason1966/Qianfan-OCR-MLX-4bit

Quantized
(5)
this model

Space using jason1966/Qianfan-OCR-MLX-4bit 1

Paper for jason1966/Qianfan-OCR-MLX-4bit