Trends in AI - Feb. '25: Reasoning Models

The world has been buzzing with the performance of DeepSeek R1, the first open-source reasoning model to compete with OpenAI's o1 series, and its implications for the dominance of Big Tech in the AGI race. But what exactly are reasoning models, and did anyone see them coming? (Spoiler: Frequent readers of our Trends in AI series certainly did!) Join us for an overview of one of the most exciting recent developments in AI R&D, as we navigate the key breakthroughs that led to this moment, and take a closer look at how they can create value in an enterprise setting.

Trends in AI February 2025: Reasoning Special. Thumbnail preview of the key developments in this month's edition of the series.

Catch us live for the next edition of our webinar for all the latest news and developments in AI:

Our first prediction for AI in 2025, made in early January, was that inference-time compute would be the driving factor of AI progress this year. Given the global turmoil caused by DeepSeek R1, we are confident this prediction is on the right track. To cut through the hype and shed light on what this paradigm shift means and its implications, we created this special edition of the Trends in AI series, doubling down on the perceived level of AI capabilities, the applications of this line of research in knowledge-intensive enterprises, and the areas where these reasoning models shine (as well as their current limitations).

Let's think step by step.

Reasoning with Large Language Models (LLMs) has been an active research field for over three years now, with this cornerstone paper by Google Brain (first published in January of 2022) having amassed over 10,000 citations. In an era when LLMs had gotten good at following user-provided instructions, the most straightforward way to elicit a reasoning behavior was to prompt the models to express their thought process "step by step" before arriving at their final answer. For the next two years, this technique was brought to the mainstream and became a standard approach in prompt engineering to improve the performance and the consistency of the outputs.

Standard vs. Chain-of-Thought Prompting diagrams. Standard gives incorrect answers; Chain-of-Thought shows step-by-step reasoning for correct answers. — *Figure from Wei et al. (2022)*

Training models with Reinforcement Learning to think before they respond.

The first model artifact that successfully showed how scaling inference-time compute can be just as effective as scaling the other two variables from the scaling laws (i.e. model and dataset size) was OpenAI's o1, which was released as a preview in September 2024. This model was unique in the aspect that it was trained with a Reinforcement Learning (RL) objective to "think productively with its chain of thought" before responding, which consistently improved its performance on benchmarks requiring reasoning, such as math problems and competitive programming questions.

Graphs with o1's AIME accuracy during training and at test time. Y-axis: pass@1 accuracy, X-axis: compute (log scale). — *Figure by OpenAI.*

The open-source community quickly rushed to reproduce this behavior, with Chinese labs such as Qwen and DeepSeek coming up with QwQ-32B-Preview and DeepSeek-R1-Lite-Preview to match the performance of o1-mini and o1-preview respectively, in late November 2024. With both labs having a strong track record of releasing the model weights and detailed recipes of how they got there, it was only a matter of time before the groundbreaking DeepSeek-R1 paper would have the research community (and soon after, the world) buzzing with how an MIT-licensed model could match the benchmark numbers of OpenAI's o1.

Bar chart comparing benchmark performances of AI models like DeepSeek-R1 and OpenAI o1. — *Figure from DeepSeek-R1 technical report.*

The magic sauce behind DeepSeek-R1.

From its technical report (and its official open-source implementation on Hugging Face), we know that DeepSeek-R1 is a Mixture-of-Experts (MoE) model with 671 billion total parameters, of which 37 billion are activated for each token. It uses the DeepSeek-V3-base model as its backbone, which required 2,644,000 H800 GPU hours (equivalent to ~$5,328,000 of compute cost) for pre-training, according to the DeepSeek-V3 technical report. The latter fact caused great panic in the US stock market, wiping away $600 billion of NVIDIA's market value (although the stock mostly recovered in the weeks after), as many speculated whether the need for its high-end GPUs had been exaggerated among big tech companies that have been stockpiling them ever since the GPT-3 paper showed that scaling training compute led to huge (and consistent) performance gains.

Table displaying training costs of DeepSeek-V3. — *Table from DeepSeek-V3 technical report.*

Notably, the R1 paper describes two artifacts: DeepSeek-R1-Zero, which has only been trained with RL on top of the base LLM and shows remarkable performance on reasoning tasks, and DeepSeek-R1, which incorporates an intermediate round of Supervised Fine-Tuning (SFT) through rejection sampling from R1-Zero and synthetic data distillation from DeepSeek-V3, while including human preference data in the RL regime to create a polished, all-purpose language model.

The fact that R1-Zero competes head-to-head with o1 despite the lack of supervised data on how to develop a chain of thought in its response is a breakthrough in itself, and a large part of the paper discusses how the RL objective (dubbed Group Relative Policy Optimization, or GRPO, which was introduced back in February 2024 (!) with DeepSeekMath) rewarded the model to naturally produce longer outputs with signs of reflection and self-criticism, which in turn improved its performance in the downstream tasks in the same fashion OpenAI showed with their o1 models. An integral part of the recipe that's also been gaining traction in the academic community (which we also covered in the December edition of the Trends in AI series with Tülu 3) appears to be the use of RL with Verifiable Rewards, where instead of relying on a critic model to predict the value of the model's outputs, the model is trained on domains where the outputs can be precisely verified, such as by using a numerical solver for maths or through unit tests for programming.

Accuracy table and line graph comparing AI models on benchmarks. — *Table and figure from DeepSeek-R1 technical report.*

Some of the equally important innovations coming from the research papers of both the current and previous-gen DeepSeek LLMs that contribute to their overall efficiency (and the pivotal $5M figure reported before) are: (i) the introduction of a technique called multi-head latent attention, compressing the KV cache used in the attention mechanism by up to 93% (leading to significant and much-needed memory gains), (ii) a stable FP8 training process, allowing them to train their models using 8 bits per parameter instead of the full 32 floating-point precision (though this is rather common in the industry), (iii) a multi-token prediction training objective, improving both accuracy and training efficiency and (iv) an auxiliary-loss-free load balancing strategy for the MoE routing, which in combination with their high sparsity ratio (37B active out of the 671B total parameters) and their sub-CUDA implementation for GPU communications and scheduling further optimized the stability and the efficiency of their training runs.

The current state of the frontier for reasoning models.

Table comparing models on Humanity's Last Exam — *Table from Humanity's Last Exam website.*

As everyone is rushing to get their reasoning model out, the competition is heating up by the minute, and deciding which one is the best overall becomes increasingly difficult. The order in the leaderboards often shifts ever so slightly depending on the domain and the particular benchmark, with the crowdsourced Chatbot Arena ranking Google DeepMind's Gemini 2.0 Flash Thinking on top while the meticulously curated (and perhaps pompously named) Humanity's Last Exam shows o3-mini surpassing both Gemini and DeepSeek-R1.

While DeepSeek is struggling to meet the demand for R1 due to the US-enforced export controls that limit their available compute (and their unbeatable pricing of $0.55/2.19 per 1M input/output tokens that can only heighten this demand), its permissive license has fired up a dozen other inference providers trying to host the model, with US-based Fireworks AI currently offering R1 for $3/8 per 1M input/output tokens at a tenth of DeepSeek's latency (but also almost half the throughput), which paints a good picture on how well optimized DeepSeek's inference stack is around their MoE architecture.

For those looking forward to self-hosting DeepSeek-R1, although technically and commercially feasible, it is going to be quite a challenge given the system requirements that are needed to serve the model at a reasonable throughput. Its 671 billion parameters roughly translate to 1,342 GB of VRAM required for FP16 inference, or roughly three nodes of 8xA100 with 80 GB VRAM each when including caches and other parallelism overheads in the calculation. It's worth mentioning that along with R1, DeepSeek also released dense variants that were trained with direct output distillation using synthetic data from R1-Zero, using Qwen and Llama backbones. Though these models should be much easier to host as they fall in the 1.5 - 70 billion parameter range, they are not "the real deal" and might underperform in non-benchmark scenarios (especially towards the lower end of the size spectrum).

Table comparing baselines and R1-Distill models on AIME 2024, MATH-500, GPQA Diamond, and Codeforces. — *Table from DeepSeek-R1 technical report.*

The future of reasoning models and their applications.

We are already witnessing the initial next steps in the research literature regarding the practical application of reasoning models in large-scale, complex applications. A recurring theme that has emerged is their integration with multi-hop retrieval, which often involves reflecting on the results and dynamically iterating on the query set to guide the exploration of the search space.

Some examples in this field so far in 2025 have been:

At Zeta Alpha, we recognized the potential of inference-time scaling to improve both search depth and breadth in Retrieval-Augmented Generation (RAG) systems early on. Our Research Assistant, released during KM World 2024 in November, marked a significant breakthrough in the overall accuracy and completeness of the generated outputs while providing direct references to the sources supporting each fact.

Diagram illustrating "Scaling Reasoning for RAG" using LLMs for research. — *A simplified illustration of the mechanics behind Zeta Alpha's Research Assistant.*

In our view, the combination of a self-refining agentic behavior with powerful reasoning models as planners and orchestrators can deliver tangible results for challenging questions where naive RAG, with its top-k result templating, falls short. Unlike other commercial solutions that adopt a rigid and pre-defined approach to AI-powered assistants, our solution is rooted in customization and domain adaptation, offering a highly modular system that can access diverse information sources and produce comprehensive reports on both proprietary and public data.

Do you want to learn more about reasoning models? Watch the full episode of our Trends in AI webinar, where we covered all of these topics in greater detail, or reach out to our experts to find out how you can capitalize on this development and convert existing knowledge within your organization into actionable insights with Zeta Alpha to turn weeks of research into minutes.

Until next time, enjoy discovery!

Reasoning Models — Trends in AI: February '25

Let's think step by step.

Training models with Reinforcement Learning to think before they respond.

The magic sauce behind DeepSeek-R1.

The current state of the frontier for reasoning models.

The future of reasoning models and their applications.

Recent Posts