Trends in AI — July 2024

The summer heat is on, with AI startups raising millions in funding, exciting open-source releases across a wide range of modalities, and brand new benchmarks to differentiate between the state-of-the-art models. Join us for an overview of the latest developments in AI R&D and a curated list of the month's top 10 trending research papers.

News Articles

Model Releases

Anthropic: Claude 3.5 Sonnet
Google: Gemma 2
Microsoft: Florence 2
NVIDIA: Nemotron 4
DeepSeek AI: DeepSeek Coder V2
Shanghai AI Lab: InternLM 2.5

Trending AI papers for June 2024

[1] Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels - J. Xian et al. (UWaterloo, Stanford, IBM) - 17 June 2024

→ PATH is a method for training neural IR models with as few as 10 labeled samples, utilizing LLMs to generate synthetic queries optimized for training tasks. 🤔 Why? Data scarcity is a common challenge in training IR systems. PATH reduces the cost and labor associated with large-scale data collection and labeling.

💡 Key Findings:

Models trained with PATH outperform those trained with existing methods such as RankZephyr, despite being up to 100x smaller.
PATH models achieve an average improvement of 4.5 points nDCG@10 on BIRCO.

[2] TextGrad: Automatic "Differentiation" via Text - M. Yuksekgonul et al. (Stanford) - 11 June 2024

→ TEXTGRAD is a framework that uses LLMs to perform automatic “differentiation” through text-based feedback, optimizing the components of AI systems by translating textual feedback into gradable improvements.

🤔 Why? Traditional methods require manual tuning or differentiable functions, which is not always feasible.

💡 Key Findings: TEXTGRAD enhances the efficiency and efficacy of AI system improvements across various applications, from coding and problem-solving to molecular design and treatment plan optimization.

[3] Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges - A. Singh et al. (UMass) - 18 June 2024

→ A comprehensive evaluation of the LLM-as-a-judge paradigm for assessing the quality of other LLM outputs. Provides an investigation of the strengths, weaknesses, alignment, and potential biases in using LLMs in this role.

🤔 Why? The growing reliance on LLMs demands scalable and reliable methods for their evaluation. Using LLMs as judges can address the scalability challenges associated with human evaluation.

💡 Key Findings:

Cohen’s kappa provides better insights into alignment compared to percentage agreement.
Some models show leniency bias, marking near-correct responses as correct more frequently.
Judge models exhibit varied abilities in distinguishing correct from incorrect answers, often struggling with under-specified answers.

[4] Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? - J. Lee et al. (Google DeepMind) - 18 June 2024

→ LOFT is a benchmark to evaluate the capabilities of Long-Context Language Models (LCLMs) in tasks traditionally reliant on specialized retrieval systems.

🤔 Why? By leveraging the capabilities of LCLMs to process and reason over long contexts, we could significantly simplify task pipelines, reduce the need for task-specific models, and enhance user-friendliness.

💡 Key Findings: LCLMs performing comparably or even surpassing specialized models in tasks such as text retrieval, Retrieval-Augmented Generation (RAG), SQL processing, and many-shot in-context learning.

[5] Mixture-of-Agents Enhances Large Language Model Capabilities - J. Wang et al. (Together AI) - 7 June 2024

→ A Mixture-of-Agents (MoA) architecture encapsulates multiple LLMs to enhance the overall language model capabilities by iteratively refining the generated responses of each agent in a layered approach.

🤔 Why? MoA addresses the challenge of leveraging the collaborative potential among LLMs, leading to more capable and efficient AI systems.

💡 Key Findings:

Using open-source models, MoA achieved a score of 65.1% on AlpacaEval 2.0 compared to 57.5% by GPT-4o.
Improved performance across dimensions such as correctness, efficiency, and factual accuracy.

[6] Language Modeling with Editable External Knowledge - B. Z. Li et al. (MIT, CMU) - 17 June 2024

→ ERASE (Enhancing Retrieval Augmentation with Self-consistent Editing) improves RAG at indexing time, by incrementally updating the knowledge base.

🤔 Why? How do we build language models that can be easily updated to reflect changes? Past work has focused on improved retrieval, but not on ensuring accuracy and consistency of the document collection itself.

💡 Key Findings:

On question answering from news articles or conversations, ERASE improves relative to RAG by 6– 13%.
Fact editing is useful for amortizing the cost of reasoning about consistency at insertion time instead of query time.

[7] Scalable MatMul-free Language Modeling - R. Zhu et al. (UCSC) - 4 June 2024

→ This work proposes an LLM architecture that eliminates matrix multiplications by using ternary weights optimized for accumulation operations and a new token mixer architecture.

🤔 Why? Matrix multiplications dominate the computational cost in LLMs. Eliminating these operations could lead to more efficient use of hardware resources.

💡 Key Findings:

Scaling laws indicate that MatMul-free models perform comparably with standard models while being efficient in memory and power consumption.
Ternary quantization on custom FPGAs accelerates training up to 5x and reduces memory consumption by up to 61%.

[8] Transformers meet Neural Algorithmic Reasoners - W. Bounsi et al. (Google DeepMind) - 13 June 2024

→ TransNAR combines the language understanding capabilities of a Transformer with the algorithmic reasoning robustness of a pre-trained GNN-based Neural Algorithmic Reasoner (NAR).

🤔 Why? The combination of the two architectures can handle complex algorithmic tasks and generalize better to OOD examples.

💡 Key Findings:

TransNAR significantly outperforms baseline Transformers, with large improvements in out-of-distribution settings.
TransNAR exhibits enhanced capabilities in producing the correct output shapes in the CLRS-30 benchmark, which is critical for maintaining algorithmic correctness.

[9] OpenVLA: An Open-Source Vision-Language-Action Model - M. Jin et al. (Stanford, UC Berkeley) - 13 June 2024

→ OpenVLA is a 7B Vision-Language-Action (VLA) model, trained on 970k episodes from the Open X-Embodiment dataset. It is designed to control multiple types of robots out of the box and can be quickly fine-tuned to new robot domains using parameter-efficient methods.

🤔 Why? Fine-tuning VLAs for new tasks is crucial for scaling their deployment in robotics applications Most existing VLAs are proprietary and not accessible for the broader community.

💡 Key Findings:

OpenVLA outperforms previous models and is robust even in scenarios involving unseen objects and complex instructions.
It demonstrates effective fine-tuning and inference capabilities on consumer-grade GPUs, achieving good performance with quantized models.

[10] Simulating 500 million years of evolution with a language model - T. Hayes et al. (EvolutionaryScale) - 2 July 2024

→ ESM3 is a generative LMM (bidirectional transformer) reasoning over the sequence, structure, and function of proteins.

🤔 Why? LMs trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins.

💡 Key Findings:

ESM3 can follow complex prompts combining modalities and is highly responsive to biological alignment.
When prompted to generate fluorescent proteins with a CoT, it synthesized a bright fluorescent protein – 42% different from known fluorescent proteins.

And a few runner-ups:

Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models - L. Yang et al. (PKU) - 6 June 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale - G. Penedo et al. (Hugging Face) - 25 June 2024
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data - Z. Xiong et al. (UW–Madison) - 27 June 2024
Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework - Z. Rackauckas et al. (Infineon, Zeta Alpha) - 20 June 2024
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track - R. Pradeep et al. (UWaterloo) - 24 June 2024
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B - D. Zhang et al. (Shanghai AI Lab) - 11 June 2024

You can find an annotated collection of these papers (+ more that didn't make the cut) in Zeta Alpha, allowing you to easily discover relevant literature and dive deeper into any topic that interests you.

Watch the full-length recording of our July 2024 Trends in AI show below, and sign up to join us live for the next edition. Until then, enjoy discovery!

Trends in AI — July 2024

News Articles

Model Releases

Trending AI papers for June 2024

Recent Posts

Comments