As the year winds down, we have a moment to pause and look back at the remarkable progress in AI over the past months. RAG and agents have transitioned from research concepts to standard enterprise lingo, while inference-time scaling emerges as the latest frontier of innovation. Join us for our year-end special for the most exciting developments in AI R&D, model releases, and the month's most trending research papers.
🔮✨🎇 Catch us live for the next edition of our webinar on January 17, kicking off 2025 with a set of hot takes and predictions on where the ball will be exactly one year from now:
News
Models
DeepSeek: DeepSeek-R1-Lite-Preview
Qwen: QwQ, Qwen-2.5-Turbo
MIT: Boltz-1
Mistral: Pixtral Large
Hugging Face: SmolVLM
Google DeepMind: AlphaQubit, GenCast
Trending AI papers for December 2024
[1] OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs - A. Asai et al. (U. Washington, Allen AI) - 21 November 2024
→ OpenScholar: a retrieval-augmented system that synthesizes academic literature to answer scientific queries through iterative refinement.
🤔 Why? Complex queries require identifying multiple relevant documents and generating long-form outputs with accurate citations.
💡 Key Results
OpenScholar-8B outperforms both GPT-4o (+5%) and PaperQA2 (+7%).
OpenScholar-GPT4o improves answer correctness over GPT4o by 12%.
GPT-4o hallucinates citations 78-90% of the time, while OpenScholar-GPT4o has citation accuracy on par with humans.
Humans prefer OpenScholar over expert answers >50% of the time, while GPT-4o is only preferred 1 out of 3 times.
[2] The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - E. Akyürek et al. (MIT) - 11 November 2024
→ Test-time training (TTT): model weights are updated via gradient steps based on test-time inputs. Works well for the ARC task.
🤔 Why? LLMs perform poorly on ARC. Is TTT a solution?
💡 Key Results
TTT significantly improves LM performance on ARC, by up to 6x.
Achieves SOTA (ensembled with BARC program synthesis) for published, purely neural models on ARC (61.9 > average human).
[3] Tülu 3: Pushing Frontiers in Open Language Model Post-Training - N. Lambert et al. (Allen AI, U. Washington) - 22 November 2024
→ Proposes a post-training recipe for LMs that combines instruction & preference tuning with "RL with Verifiable Rewards" to improve math and instruction following skills.
🤔 Why? Careful data curation by identifying the LM’s deficiencies while training ensures a balanced performance on the core skills.
💡 Key Results
Tülu 3 8B outperforms other open-weight models of the same size (Llama 3.1, Qwen 2.5, Mistral).
Tülu 3 70B matches closed models such as GPT-4o-mini and approaches the performance of Claude 3.5 Haiku.
[4] DeMo: Decoupled Momentum Optimization - B. Peng et al. (Nous Research) - 29 November 2024
→ DeMo is a new DL training algorithm that decouples momentum updates, reducing inter-accelerator communication.
🤔 Why? Not everyone has an Infiniband HPC setup for training large models. Think e.g. of distributed training across networks of volunteer computers.
💡 Key Results
DeMo >= AdamW training.
Communication requirements cut by several orders of magnitude, enabling training of large neural networks with limited network bandwidth.
Negligible compute & memory overhead.
[5] Drowning in Documents: Consequences of Scaling Reranker Inference - M. Jacob et al. (Databricks) - 18 November 2024
→ Investigation into the performance of rerankers when scoring a large number of documents.
🤔 Why? More in-depth evaluation rather than only acting as a second stage re-scoring method.
💡 Key Results
Rerankers initially improve recall (K<100) but then degrade significantly, performing worse than dense retrievers.
For large K, rerankers assign high scores to irrelevant documents (despite minimal lexical/semantic overlap with the query).
For current-gen rerankers, K needs to be tuned per dataset.
Listwise reranking with LLMs seems more promising & robust.
[6] Learning high-accuracy error decoding for quantum processors - J. Bausch et al. (Google DeepMind) - 20 November 2024
→ AlphaQubit: a new transformer-based neural net designed to decode a leading quantum error-correction code.
🤔 Why? Solves a major issue in scaling up quantum computing systems.
💡 Key Results
Outperforms state-of-the-art decoders on data from Google's Sycamore quantum processor for distance-3 and distance-5 surface codes.
Works well on data with realistic noise, e.g. cross-talk and leakage, by utilizing soft readouts and leakage information.
Unfortunately too slow to be applied in practice. The current decoding speed is 1-2 orders of magnitude above the target throughput rate of 1 μs.
[7] M3DocRAG Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding - J. Cho et al. (Bloomberg) - 07 November 2024
→ M3Doc: Multi-modal Multi-page Multi-document. A multimodal RAG framework for DocVQA that handles multiple document contexts, question hops, and evidence modalities.
🤔 Why? Emphasis on getting answers from multi-page docs & visual information.
💡 Key Results
ColPali + Qwen2-VL is the best combo in 3 DocVQA benchmarks.
Successful even when relevant information spans multiple pages, or is only present in images.
A FAISS IVF index reduces search latency with minimal accuracy loss.
[8] Generative Agent Simulations of 1,000 People - J. Park et al. (Stanford) - 15 November 2024
→ Based on interviews, an ‘agent bank’ of 1000 generative agents was created to perform simulations with digital twins of real (US) people.
🤔 Why? Access to an agent bank modeled on a stratified sample of people can help social science studies using AI-based tools.
💡 Key Results
Agents replicated participants' responses on GSS and Big Five Personality well.
Good predicting power regarding individual attitudes, traits, and behaviors.
Interview-based outperforms demographic or persona-based.
[9] Ranking Unraveled Recipes for LLM Rankings in Head-to-Head AI Combat - R. Daynauth et al. (U. Michigan) - 19 November 2024
→ Guidelines for LLM selection through pairwise comparisons using human-defined criteria.
🤔 Why? Pairwise ranking is more reliable than traditional benchmarks as it aligns with human judgment, but can be sensitive to parameters.
💡 Key Results
Elo: highest prediction accuracy in uneven datasets, but very sensitive to the k factor. They do not recommend using it.
Bradley-Terry: best in preserving transitivity. Recommendation: small (balanced) datasets.
Glicko: effective in managing uncertainty and preventing low matchup models from being ranked too high, consistent across different datasets. Recommendation: large (and especially uneven) datasets.
[10] The Super Weight in Large Language Models - M. Yu et al. (Apple) - 11 November 2024
→ "Super weights" play a critical role in LLMs. Pruning even a single superweight can drastically impact model performance, leading to gibberish text generation.
🤔 Why? This counterintuitive effect has a large impact on model quality under weight quantization.
💡 Key Results
Identified super weights that significantly influence model quality.
Developed a data-free method to detect superweights with a single forward pass.
Demonstrated that preserving super weights during quantization improves model accuracy noticeably.
And a few runner-ups:
PaliGemma 2 A Family of Versatile VLMs for Transfer - A. Steiner et al. (Google DeepMind) - 04 December 2024
Scaling Laws for Precision - T. Kumar et al. (Harvard, MIT) - 06 November 2024
DMQR-RAG Diverse Multi-Query Rewriting for RAG - Z. Li et al. (Kuaishou Technology) - 20 November 2024
LLaVA-CoT Let Vision Language Models Reason Step-by-Step - G. Xu et al. (Tsinghua, PKU) - 15 November 2024
Marco-o1 Towards Open Reasoning Models for Open-Ended Solutions - Y. Zhao et al. (Alibaba) - 21 November 2024
Generative World Explorer - T. Lu et al. (JHU) - 18 November 2024
You can find an annotated collection of these papers (+ more that didn't make the cut) in Zeta Alpha, allowing you to easily discover relevant literature and dive deeper into any topic that interests you.
Here is a 3-minute overview of the papers in our top-10 list:
As always, the full recording of our latest Trends in AI episode is available on our YouTube, covering all of the news, model releases, and papers in depth.
Have a great New Year, and until next time, enjoy discovery!