Cohere cracks lossless quantization and native citations with first full Apache 2.0 licensed open model Command A+
AI Summary
Cohere, a Canadian AI lab, has released Command A+, a highly efficient, 218-billion-parameter sparse mixture-of-experts language model under an open-source Apache 2.0 license. This model offers significant compute and energy efficiency gains and supports sovereign AI by allowing secure enterprise deployment.
Canadian AI lab Cohere made waves recently by announcing a merger with German AI startup Aleph Alpha, but now it has even more in store for enterprise builders around the globe: today, the firm co-founded by former Googler and "Attention Is All You Need" co-author Aidan Gomez unveiled Command A+, a highly optimized, 218-billion-parameter language model engineered specifically for complex reasoning, multimodal document processing, and agentic workflows. The most significant aspect of the release is not just the model’s capabilities; it is its accessibility. By releasing the model weights free on the popular AI code sharing repository Hugging Face under a highly permissive Apache 2.0 open-source license — a first for the company, according to a post by Gomez, now Cohere's CEO, on X — Cohere is making a calculated bet on "sovereign AI"—the thesis that enterprises, governments, and developers should have the ability to run, control, and adapt frontier-grade AI entirely within their own secure environments, without sacrificing performance. Sparse architecture with extreme quantization At the architectural level, Command A+ represents a major evolution from Cohere’s previous dense models. It is a decoder-only Sparse Mixture-of-Experts (MoE) Transformer. While the model houses a relatively modest 218 billion total parameters, even fewer — only 25 billion — are active during any given generation step. It's a much lighter footprint and requires far less compute resources to run in inference (serving the model in production environments to end users or via agents) than the proprietary U.S. giants like OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7, which are estimated by third-party observers to be in the trillions of parameters. This sparse architecture is the key to the model’s efficiency. In plain terms, an MoE model routes incoming queries only to the specific "expert" neural networks best suited to handle them, leaving the rest of the model dormant. This is a familiar formulation and one followed by most leading LLMs these days, allowing models to retain the vast knowledge base and nuanced reasoning capabilities of a giant, but at the faster speeds and reduced compute and energy requirements of a much smaller model, since only a fraction of parameters are ever activated at any time. But where Cohere has taken an extra step beyond most for Command A+ is that it has focused heavily on hardware efficiency through quantization—a process that compresses the model's memory footprint by reducing the precision of its parameters. Command A+ is available in 16-bit (BF16), 8-bit (FP8), and a highly compressed 4-bit (W4A4) format. The W4A4 quantization is the technical centerpiece of this release. Typically, reasoning models suffer an outsized "quantization tax," where compressing the model leads to visible regressions in complex problem-solving. Cohere mitigated this by only quantizing the MoE experts to 4-bit, while keeping the critical attention pathways at full precision, supplemented by a technique called Quantization-Aware Distillation. The result is a nearly lossless compression that allows this massive model to run on a single NVIDIA Blackwell B200 GPU or just two NVIDIA H100 GPUs. The speed gains are equally notable. According to performance data released by the company, the W4A4 quantization at low concurrency achieves 375 tokens per second (TOPS) with a Time-to-First-Token (TTFT) latency of just 113 milliseconds—representing up to a 63% increase in output speed and a 17% reduction in latency compared to the previous Command A Reasoning model. Furthermore, Cohere has overhauled the model's tokenizer. Tokenizers break text down into the fragments that AI models process. The new tokenizer is highly optimized for global enterprise use, featuring native support for 48 languages. More importantly, it dramatically improves tokenization efficiency for non-European languages, reducing the number of tokens required to generate responses in Arabic by 20%, Japanese by 18%, and Korean by 16%. Because inference costs are calculated per token, this translates directly to lower operational costs for global, multilingual or non-English deployments. Agentic workflows and high benchmarks on math, specialized fields While raw speed and size dictate deployment, a model’s utility is defined by its product capabilities. Command A+ was built specifically for "agentic" tasks — workflows where the AI operates autonomously or semi-autonomously, uses external tools, queries databases, and synthesizes information across multiple steps. The benchmark leaps over the previous generation are stark. On 𝜏²-Bench Telecom, which tests complex reasoning, the model jumped from a 37% score to 85%. On Terminal-Bench Hard, which measures agentic coding performance, it climbed from 3% to 25%. In complex mathematics, it scored 90% on AIME 25, up from 57%. Command A+ punches above its weight class (25B active parameters) in pure reasoning and mat