top of page
Writer's pictureDinos Papakostas

Trends in AI — December 2024

As the year winds down, we have a moment to pause and look back at the remarkable progress in AI over the past months. RAG and agents have transitioned from research concepts to standard enterprise lingo, while inference-time scaling emerges as the latest frontier of innovation. Join us for our year-end special for the most exciting developments in AI R&D, model releases, and the month's most trending research papers.

 

🔮✨🎇 Catch us live for the next edition of our webinar on January 17, kicking off 2025 with a set of hot takes and predictions on where the ball will be exactly one year from now:

 

News

Models

Trending AI papers for December 2024

[1] OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs - A. Asai et al. (U. Washington, Allen AI) - 21 November 2024

OpenScholar: a retrieval-augmented system that synthesizes academic literature to answer scientific queries through iterative refinement.



🤔 Why? Complex queries require identifying multiple relevant documents and generating long-form outputs with accurate citations.


💡 Key Results

  • OpenScholar-8B outperforms both GPT-4o (+5%) and PaperQA2 (+7%).

  • OpenScholar-GPT4o improves answer correctness over GPT4o by 12%.

  • GPT-4o hallucinates citations 78-90% of the time, while OpenScholar-GPT4o has citation accuracy on par with humans.

  • Humans prefer OpenScholar over expert answers >50% of the time, while GPT-4o is only preferred 1 out of 3 times.


[2] The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - E. Akyürek et al. (MIT) - 11 November 2024

Test-time training (TTT): model weights are updated via gradient steps based on test-time inputs. Works well for the ARC task.


🤔 Why? LLMs perform poorly on ARC. Is TTT a solution?


💡 Key Results

  • TTT significantly improves LM performance on ARC, by up to 6x.

  • Achieves SOTA (ensembled with BARC program synthesis) for published, purely neural models on ARC (61.9 > average human).



[3] Tülu 3: Pushing Frontiers in Open Language Model Post-Training - N. Lambert et al. (Allen AI, U. Washington) - 22 November 2024

Proposes a post-training recipe for LMs that combines instruction & preference tuning with "RL with Verifiable Rewards" to improve math and instruction following skills.



🤔 Why? Careful data curation by identifying the LM’s deficiencies while training ensures a balanced performance on the core skills.


💡 Key Results

  • Tülu 3 8B outperforms other open-weight models of the same size (Llama 3.1, Qwen 2.5, Mistral).

  • Tülu 3 70B matches closed models such as GPT-4o-mini and approaches the performance of Claude 3.5 Haiku.


[4] DeMo: Decoupled Momentum Optimization - B. Peng et al. (Nous Research) - 29 November 2024

DeMo is a new DL training algorithm that decouples momentum updates, reducing inter-accelerator communication.


🤔 Why? Not everyone has an Infiniband HPC setup for training large models. Think e.g. of distributed training across networks of volunteer computers.


💡 Key Results

  • DeMo >= AdamW training.

  • Communication requirements cut by several orders of magnitude, enabling training of large neural networks with limited network bandwidth.

  • Negligible compute & memory overhead.



[5] Drowning in Documents: Consequences of Scaling Reranker Inference - M. Jacob et al. (Databricks) - 18 November 2024

Investigation into the performance of rerankers when scoring a large number of documents.


🤔 Why? More in-depth evaluation rather than only acting as a second stage re-scoring method.


💡 Key Results

  • Rerankers initially improve recall (K<100) but then degrade significantly, performing worse than dense retrievers.

  • For large K, rerankers assign high scores to irrelevant documents (despite minimal lexical/semantic overlap with the query).

  • For current-gen rerankers, K needs to be tuned per dataset.

  • Listwise reranking with LLMs seems more promising & robust.


[6] Learning high-accuracy error decoding for quantum processors - J. Bausch et al. (Google DeepMind) - 20 November 2024

AlphaQubit: a new transformer-based neural net designed to decode a leading quantum error-correction code.



🤔 Why? Solves a major issue in scaling up quantum computing systems.


💡 Key Results

  • Outperforms state-of-the-art decoders on data from Google's Sycamore quantum processor for distance-3 and distance-5 surface codes. 

  • Works well on data with realistic noise, e.g. cross-talk and leakage, by utilizing soft readouts and leakage information. 

  • Unfortunately too slow to be applied in practice. The current decoding speed is 1-2 orders of magnitude above the target throughput rate of 1 μs.



M3Doc: Multi-modal Multi-page Multi-document. A multimodal RAG framework for DocVQA that handles multiple document contexts, question hops, and evidence modalities.



🤔 Why? Emphasis on getting answers from multi-page docs & visual information.


💡 Key Results

  • ColPali + Qwen2-VL is the best combo in 3 DocVQA benchmarks.

  • Successful even when relevant information spans multiple pages, or is only present in images.

  • A FAISS IVF index reduces search latency with minimal accuracy loss.


[8] Generative Agent Simulations of 1,000 People - J. Park et al. (Stanford) - 15 November 2024

Based on interviews, an ‘agent bank’ of 1000 generative agents was created to perform simulations with digital twins of real (US) people.



🤔 Why? Access to an agent bank modeled on a stratified sample of people can help social science studies using AI-based tools.


💡 Key Results

  • Agents replicated participants' responses on GSS and Big Five Personality well.

  • Good predicting power regarding individual attitudes, traits, and behaviors.

  • Interview-based outperforms demographic or persona-based.


[9] Ranking Unraveled Recipes for LLM Rankings in Head-to-Head AI Combat - R. Daynauth et al. (U. Michigan) - 19 November 2024

Guidelines for LLM selection through pairwise comparisons using human-defined criteria.


🤔 Why? Pairwise ranking is more reliable than traditional benchmarks as it aligns with human judgment, but can be sensitive to parameters.


💡 Key Results

  • Elo: highest prediction accuracy in uneven datasets, but very sensitive to the k factor. They do not recommend using it.

  • Bradley-Terry: best in preserving transitivity. Recommendation: small (balanced) datasets.

  • Glicko: effective in managing uncertainty and preventing low matchup models from being ranked too high, consistent across different datasets. Recommendation: large (and especially uneven) datasets.



[10] The Super Weight in Large Language Models - M. Yu et al. (Apple) - 11 November 2024

"Super weights" play a critical role in LLMs. Pruning even a single superweight can drastically impact model performance, leading to gibberish text generation.



🤔 Why? This counterintuitive effect has a large impact on model quality under weight quantization.


💡 Key Results

  • Identified super weights that significantly influence model quality.

  • Developed a data-free method to detect superweights with a single forward pass.

  • Demonstrated that preserving super weights during quantization improves model accuracy noticeably.


And a few runner-ups:


You can find an annotated collection of these papers (+ more that didn't make the cut) in Zeta Alpha, allowing you to easily discover relevant literature and dive deeper into any topic that interests you.


Here is a 3-minute overview of the papers in our top-10 list:


As always, the full recording of our latest Trends in AI episode is available on our YouTube, covering all of the news, model releases, and papers in depth.


Have a great New Year, and until next time, enjoy discovery!

57 views0 comments

Recent Posts

See All
bottom of page