Mistral on Edge

paid

Mistral's Ministral 3B and 8B are state-of-the-art edge language models for on-device, privacy-first AI inference with 128k context and low-latency performance.

AI Models & Infrastructure

Foundation Models

LLM Developer Tools

About

Mistral on Edge presents les Ministraux — Ministral 3B and Ministral 8B — two state-of-the-art small language models purpose-built for edge and on-device deployments. Released on the first anniversary of Mistral 7B, these models set a new benchmark in the sub-10B category across knowledge, commonsense reasoning, function-calling, and inference efficiency. Ministral 3B and Ministral 8B both support up to 128k context length and are designed for latency-sensitive, privacy-first applications. Ministral 8B additionally features a special interleaved sliding-window attention pattern for faster, memory-efficient inference. Both models outperform peers such as Gemma 2 and Llama 3.2 in their respective size classes. Key use cases include on-device translation, internet-less smart assistants, local analytics, and autonomous robotics. The models are also designed to serve as efficient intermediaries in multi-step agentic workflows alongside larger models like Mistral Large — handling input parsing, task routing, and API calls at extremely low latency and cost. Both models are available via Mistral's API (la Plateforme) at $0.04/M and $0.1/M tokens respectively, and Ministral 8B weights are available for research use. They are ideal for developers, enterprises, and researchers seeking compute-efficient AI at the edge.

Key Features

Sub-10B State-of-the-Art Performance: Ministral 3B and 8B consistently outperform competitors like Gemma 2 and Llama 3.2 across knowledge, reasoning, and function-calling benchmarks in their size class.
128k Context Length: Both models support up to 128k tokens of context, enabling long document understanding and complex multi-turn conversations on edge devices.
Memory-Efficient Inference: Ministral 8B features an interleaved sliding-window attention pattern for faster and more memory-efficient inference, ideal for resource-constrained environments.
Agentic Workflow Integration: Les Ministraux serve as efficient intermediaries in multi-step agentic pipelines, handling input parsing, task routing, and API calls alongside larger models.
Privacy-First On-Device Deployment: Designed for local, offline-capable inference supporting critical applications like on-device translation, internet-less assistants, and autonomous robotics.

Use Cases

On-device translation for offline or privacy-sensitive mobile and desktop applications
Internet-less smart assistants running locally on edge hardware without cloud dependency
Local data analytics and business intelligence on-premise without sending data to the cloud
Autonomous robotics requiring low-latency, real-time decision-making at the edge
Efficient task routing and API orchestration in multi-step agentic AI workflows alongside larger models

Pros

Best-in-Class Edge Performance: Outperforms competing models such as Llama 3.2 3B and Gemma 2 across multiple benchmarks while staying within the sub-10B parameter budget.
Ultra-Low API Pricing: At $0.04/M tokens for Ministral 3B and $0.1/M for Ministral 8B, these models offer exceptional cost efficiency for high-volume or latency-sensitive workloads.
Research Access to Model Weights: Ministral 8B Instruct weights are available for research use, enabling fine-tuning and experimentation without full commercial licensing.
Long Context Support: 128k context length is rare in the sub-10B space and enables sophisticated use cases like document summarization and multi-turn agentic tasks.

Cons

Commercial License Required for Self-Deployment: Using the models in production outside of la Plateforme requires a commercial license, which must be negotiated directly with Mistral AI.
Limited vLLM Context Window: On vLLM, the context length is currently capped at 32k rather than the full 128k, which may be limiting for open-source self-hosted deployments.
Ministral 3B Weights Not Publicly Available: Unlike Ministral 8B, the Ministral 3B model weights are not released for research use, restricting fine-tuning and local experimentation.

Frequently Asked Questions

They are two state-of-the-art small language models from Mistral AI, optimized for on-device and edge computing. They excel in reasoning, knowledge, function-calling, and efficiency within the sub-10B parameter class.

Ministral 3B is priced at $0.04 per million tokens (input and output), and Ministral 8B at $0.1 per million tokens, both available on Mistral's la Plateforme.

Ministral 8B Instruct weights are available for research use. For commercial self-deployment of either model, you need to contact Mistral AI for a commercial license.

Both models support up to 128k context length. Note that on vLLM, context is currently limited to 32k tokens.

They are designed for on-device translation, internet-less smart assistants, local analytics, autonomous robotics, and as efficient function-calling intermediaries in multi-step agentic workflows.