Liquid AI
Liquid: Playground

LFM2-700M

LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency.

We're releasing the weights of three post-trained checkpoints with 350M, 700M, and 1.2B parameters. They provide the following key features to create AI-powered edge applications:

  • Fast training & inference – LFM2 achieves 3x faster training compared to its previous generation. It also benefits from 2x faster decode and prefill speed on CPU compared to Qwen3.
  • Best performance – LFM2 outperforms similarly-sized models across multiple benchmark categories, including knowledge, mathematics, instruction following, and multilingual capabilities.
  • New architecture – LFM2 is a new hybrid Liquid model with multiplicative gates and short convolutions.
  • Flexible deployment – LFM2 runs efficiently on CPU, GPU, and NPU hardware for flexible deployment on smartphones, laptops, or vehicles.

Find more information about LFM2 in our blog post.

πŸ“„ Model details

Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills.

Property Value
Parameters 742,489,344
Layers 16 (10 conv + 6 attn)
Context length 32,768 tokens
Vocabulary size 65,536
Precision bfloat16
Training budget 10 trillion tokens
License LFM Open License v1.0

Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.

Generation parameters: We recommend the following parameters:

  • temperature=0.3
  • min_p=0.15
  • repetition_penalty=1.05

Architecture: Hybrid model with multiplicative gates and short convolutions: 10 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks.

Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials.

Training approach:

  • Knowledge distillation using LFM1-7B as teacher model
  • Very large-scale SFT on 50% downstream tasks, 50% general domains
  • Custom DPO with length normalization and semi-online datasets
  • Iterative model merging

πŸƒ How to run LFM2

Transformers.js

If you haven't already, you can install the Transformers.js JavaScript library from NPM using:

npm i @huggingface/transformers

Example: Basic example

import { pipeline, TextStreamer } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/LFM2-700M-ONNX",
  { dtype: "q4", device: "webgpu" },
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "What is the capital of France?" },
];

// Generate a response
const output = await generator(messages, {
    max_new_tokens: 512,
    do_sample: false,
    streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: true }),
});
console.log(output[0].generated_text.at(-1).content);
// The capital of France is Paris.

Example: Tool calling

import { pipeline, TextStreamer } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/LFM2-700M-ONNX",
  { dtype: "q4", device: "webgpu" },
);

// Define the tools available to the model
const tools = [
  {
    name: "get_weather",
    description: "Get current weather information for a location",
    parameters: {
      type: "object",
      properties: {
        location: {
          type: "string",
          description: "The city and state, e.g. San Francisco, CA",
        },
        unit: {
          type: "string",
          enum: ["celsius", "fahrenheit"],
          description: "The unit of temperature to use",
        },
      },
      required: ["location"],
    },
  },
];

// Define the list of messages
const messages = [
  { role: "user", content: "What's the weather like in New York?" },
];

// Generate a response
const output = await generator(messages, {
    max_new_tokens: 512,
    do_sample: false,
    streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: true }),
    tokenizer_encode_kwargs: { tools },
});
console.log(output[0].generated_text.at(-1).content);
// [get_weather(location="New York", unit="fahrenheit")]

ONNXRuntime

from transformers import AutoConfig, AutoTokenizer
import onnxruntime
import numpy as np
from huggingface_hub import snapshot_download

# 1. Load config, processor, and model
model_id = "onnx-community/LFM2-700M-ONNX"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eos_token_id = config.eos_token_id

filename = "model.onnx" # Options: "model.onnx", "model_fp16.onnx", "model_q4.onnx", "model_q4f16.onnx"
model_path = snapshot_download(repo_id=model_id, allow_patterns=f"onnx/{filename}*") # Download the graph + weights
session = onnxruntime.InferenceSession(f"{model_path}/onnx/{filename}")

# 2. Prepare inputs
prompt = "What is C. elegans?"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
batch_size = input_ids.shape[0]
num_logits_to_keep = np.array(1, dtype=np.int64)

past_cache_values = {}
for inp in session.get_inputs():
    name = inp.name
    shape = inp.shape
    dtype = np.float32 if inp.type == "tensor(float)" else np.float16
    if name.startswith("past_key_values"):
        # Attention KV cache: shape [batch_size, num_kv_heads, 0, head_dim]
        past_cache_values[name] = np.zeros([batch_size, shape[1], 0, shape[3]], dtype=dtype)
    elif name.startswith("past_conv"):
        # Conv cache: shape [batch_size, hidden_size, conv_L_cache]
        past_cache_values[name] = np.zeros([batch_size, shape[1], shape[2]], dtype=dtype)

# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
  logits, *present_cache_values = session.run(None, dict(
      input_ids=input_ids,
      attention_mask=attention_mask,
      num_logits_to_keep=num_logits_to_keep,
      **past_cache_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
  for j, key in enumerate(past_cache_values):
    past_cache_values[key] = present_cache_values[j]
  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
  if np.isin(input_ids, eos_token_id).any():
      break

  ## (Optional) Streaming
  print(tokenizer.decode(input_ids[0]), end='', flush=True)
print()

# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])
Downloads last month
408
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for onnx-community/LFM2-700M-ONNX

Quantized
(20)
this model