Trends in AI — December 2024

Dec 9, 20245 min read

As the year winds down, we have a moment to pause and look back at the remarkable progress in AI over the past months. RAG and agents have transitioned from research concepts to standard enterprise lingo, while inference-time scaling emerges as the latest frontier of innovation. Join us for our year-end special for the most exciting developments in AI R&D, model releases, and the month's most trending research papers.

🔮✨🎇 Catch us live for the next edition of our webinar on January 17, kicking off 2025 with a set of hot takes and predictions on where the ball will be exactly one year from now:

News

Models

DeepSeek: DeepSeek-R1-Lite-Preview
Qwen: QwQ, Qwen-2.5-Turbo
Allen AI: Tülu 3, OLMo 2
MIT: Boltz-1
Mistral: Pixtral Large
Hugging Face: SmolVLM
Google DeepMind: AlphaQubit, GenCast

Trending AI papers for December 2024

[1] OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs - A. Asai et al. (U. Washington, Allen AI) - 21 November 2024

→ OpenScholar: a retrieval-augmented system that synthesizes academic literature to answer scientific queries through iterative refinement.

🤔 Why? Complex queries require identifying multiple relevant documents and generating long-form outputs with accurate citations.

💡 Key Results

OpenScholar-8B outperforms both GPT-4o (+5%) and PaperQA2 (+7%).
OpenScholar-GPT4o improves answer correctness over GPT4o by 12%.
GPT-4o hallucinates citations 78-90% of the time, while OpenScholar-GPT4o has citation accuracy on par with humans.
Humans prefer OpenScholar over expert answers >50% of the time, while GPT-4o is only preferred 1 out of 3 times.

[2] The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - E. Akyürek et al. (MIT) - 11 November 2024

→ Test-time training (TTT): model weights are updated via gradient steps based on test-time inputs. Works well for the ARC task.

🤔 Why? LLMs perform poorly on ARC. Is TTT a solution?

💡 Key Results

TTT significantly improves LM performance on ARC, by up to 6x.
Achieves SOTA (ensembled with BARC program synthesis) for published, purely neural models on ARC (61.9 > average human).

[3] Tülu 3: Pushing Frontiers in Open Language Model Post-Training - N. Lambert et al. (Allen AI, U. Washington) - 22 November 2024

→ Proposes a post-training recipe for LMs that combines instruction & preference tuning with "RL with Verifiable Rewards" to improve math and instruction following skills.

🤔 Why? Careful data curation by identifying the LM’s deficiencies while training ensures a balanced performance on the core skills.

💡 Key Results

Tülu 3 8B outperforms other open-weight models of the same size (Llama 3.1, Qwen 2.5, Mistral).
Tülu 3 70B matches closed models such as GPT-4o-mini and approaches the performance of Claude 3.5 Haiku.

[4] DeMo: Decoupled Momentum Optimization - B. Peng et al. (Nous Research) - 29 November 2024

→ DeMo is a new DL training algorithm that decouples momentum updates, reducing inter-accelerator communication.

🤔 Why? Not everyone has an Infiniband HPC setup for training large models. Think e.g. of distributed training across networks of volunteer computers.

💡 Key Results

DeMo >= AdamW training.
Communication requirements cut by several orders of magnitude, enabling training of large neural networks with limited network bandwidth.
Negligible compute & memory overhead.

[5] Drowning in Documents: Consequences of Scaling Reranker Inference - M. Jacob et al. (Databricks) - 18 November 2024

→ Investigation into the performance of rerankers when scoring a large number of documents.

🤔 Why? More in-depth evaluation rather than only acting as a second stage re-scoring method.

💡 Key Results

Rerankers initially improve recall (K<100) but then degrade significantly, performing worse than dense retrievers.
For large K, rerankers assign high scores to irrelevant documents (despite minimal lexical/semantic overlap with the query).
For current-gen rerankers, K needs to be tuned per dataset.
Listwise reranking with LLMs seems more promising & robust.

[6] Learning high-accuracy error decoding for quantum processors - J. Bausch et al. (Google DeepMind) - 20 November 2024

→ AlphaQubit: a new transformer-based neural net designed to decode a leading quantum error-correction code.

🤔 Why? Solves a major issue in scaling up quantum computing systems.

💡 Key Results

Outperforms state-of-the-art decoders on data from Google's Sycamore quantum processor for distance-3 and distance-5 surface codes.
Works well on data with realistic noise, e.g. cross-talk and leakage, by utilizing soft readouts and leakage information.
Unfortunately too slow to be applied in practice. The current decoding speed is 1-2 orders of magnitude above the target throughput rate of 1 μs.

[7] M3DocRAG Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding - J. Cho et al. (Bloomberg) - 07 November 2024

→ M3Doc: Multi-modal Multi-page Multi-document. A multimodal RAG framework for DocVQA that handles multiple document contexts, question hops, and evidence modalities.

🤔 Why? Emphasis on getting answers from multi-page docs & visual information.

💡 Key Results

ColPali + Qwen2-VL is the best combo in 3 DocVQA benchmarks.
Successful even when relevant information spans multiple pages, or is only present in images.
A FAISS IVF index reduces search latency with minimal accuracy loss.

[8] Generative Agent Simulations of 1,000 People - J. Park et al. (Stanford) - 15 November 2024

→ Based on interviews, an ‘agent bank’ of 1000 generative agents was created to perform simulations with digital twins of real (US) people.

🤔 Why? Access to an agent bank modeled on a stratified sample of people can help social science studies using AI-based tools.

💡 Key Results

Agents replicated participants' responses on GSS and Big Five Personality well.
Good predicting power regarding individual attitudes, traits, and behaviors.
Interview-based outperforms demographic or persona-based.

[9] Ranking Unraveled Recipes for LLM Rankings in Head-to-Head AI Combat - R. Daynauth et al. (U. Michigan) - 19 November 2024

→ Guidelines for LLM selection through pairwise comparisons using human-defined criteria.

🤔 Why? Pairwise ranking is more reliable than traditional benchmarks as it aligns with human judgment, but can be sensitive to parameters.

💡 Key Results

Elo: highest prediction accuracy in uneven datasets, but very sensitive to the k factor. They do not recommend using it.
Bradley-Terry: best in preserving transitivity. Recommendation: small (balanced) datasets.
Glicko: effective in managing uncertainty and preventing low matchup models from being ranked too high, consistent across different datasets. Recommendation: large (and especially uneven) datasets.

[10] The Super Weight in Large Language Models - M. Yu et al. (Apple) - 11 November 2024

→ "Super weights" play a critical role in LLMs. Pruning even a single superweight can drastically impact model performance, leading to gibberish text generation.

🤔 Why? This counterintuitive effect has a large impact on model quality under weight quantization.

💡 Key Results

Identified super weights that significantly influence model quality.
Developed a data-free method to detect superweights with a single forward pass.
Demonstrated that preserving super weights during quantization improves model accuracy noticeably.

And a few runner-ups:

PaliGemma 2 A Family of Versatile VLMs for Transfer - A. Steiner et al. (Google DeepMind) - 04 December 2024
Scaling Laws for Precision - T. Kumar et al. (Harvard, MIT) - 06 November 2024
DMQR-RAG Diverse Multi-Query Rewriting for RAG - Z. Li et al. (Kuaishou Technology) - 20 November 2024
LLaVA-CoT Let Vision Language Models Reason Step-by-Step - G. Xu et al. (Tsinghua, PKU) - 15 November 2024
Marco-o1 Towards Open Reasoning Models for Open-Ended Solutions - Y. Zhao et al. (Alibaba) - 21 November 2024
Generative World Explorer - T. Lu et al. (JHU) - 18 November 2024

You can find an annotated collection of these papers (+ more that didn't make the cut) in Zeta Alpha, allowing you to easily discover relevant literature and dive deeper into any topic that interests you.

Here is a 3-minute overview of the papers in our top-10 list:

As always, the full recording of our latest Trends in AI episode is available on our YouTube, covering all of the news, model releases, and papers in depth.

Have a great New Year, and until next time, enjoy discovery!

Trends in AI — December 2024

News

Models

Trending AI papers for December 2024

[1] OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs - A. Asai et al. (U. Washington, Allen AI) - 21 November 2024

[2] The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - E. Akyürek et al. (MIT) - 11 November 2024

[3] Tülu 3: Pushing Frontiers in Open Language Model Post-Training - N. Lambert et al. (Allen AI, U. Washington) - 22 November 2024

[4] DeMo: Decoupled Momentum Optimization - B. Peng et al. (Nous Research) - 29 November 2024

[5] Drowning in Documents: Consequences of Scaling Reranker Inference - M. Jacob et al. (Databricks) - 18 November 2024

[6] Learning high-accuracy error decoding for quantum processors - J. Bausch et al. (Google DeepMind) - 20 November 2024

[7] M3DocRAG Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding - J. Cho et al. (Bloomberg) - 07 November 2024

[8] Generative Agent Simulations of 1,000 People - J. Park et al. (Stanford) - 15 November 2024

[9] Ranking Unraveled Recipes for LLM Rankings in Head-to-Head AI Combat - R. Daynauth et al. (U. Michigan) - 19 November 2024

[10] The Super Weight in Large Language Models - M. Yu et al. (Apple) - 11 November 2024

Recent Posts

Comentarios