Sky Seminar: Beidi Chen (CMU) – MagicPIG & GSM-Infinite: Rethinking the Efficiency and Capabilities of Long-Context LLMs

Speaker: Beidi Chen
Location: Soda Hall, 510
Date: March 27th, 2025 (Thursday)
Time: 12 – 1 pm PT

MagicPIG & GSM-Infinite: Rethinking the Efficiency and Capabilities of Long-Context LLMs

Abstract:
Large language models (LLMs) with extremely long context windows (100k-1M) have gained significant attention, but there are challenges in their practical applicability, particularly in terms of efficiency and capabilities for real-world use cases. In this talk, we will showcase how we tackle the efficiency bottlenecks in autoregressive models (KV Cache) via MagicPIG, and how we decompose and evaluate long-context capabilities – reasoning and memorization (retrieval) – through GSM-Infinite.

First, we introduce MagicPIG, a heterogeneous GPU-CPU framework using Locality Sensitive Hashing (LSH) sampling to significantly reduce attention computation workload while maintaining high accuracy across diverse tasks. While many dynamic sparse or TopK-based attention methods leverage the common insight that attention is sparse, MagicPIG shows that even TopK attention itself suffers from severe quality degradation in certain tasks. The key insight in MagicPIG is that we could use LSH-based sampling with theoretical guarantees for more accurate attention that goes beyond the whole line of work in TopK attention. By storing hash tables and computing attention on the CPU, MagicPIG efficiently handles longer contexts and larger batches, improving decoding throughput by 1.76–5x on various GPUs. It achieves a 54ms decoding latency on a single RTX 4090 for the Llama-3.1-8B-Instruct model with a 100k context length.

Next, we introduce GSM-Infinite. Long-context LLMs have shown strong performance in tasks like information retrieval and long-document QA. However, truly solving complex intellectual problems requires effective reasoning over long and intricate contexts. Existing benchmarks fail to provide a solid basis for evaluating reasoning complexity at scale. Inspired by GSM-8K, we design GSM-Infinite, a math problem generator that creates arithmetic problems of unlimited difficulty and context length, with fine-grained control. By representing problems as computational graphs and adding noise through extra nodes and edges, we systematically test LLM reasoning. We observe a sigmoid decline in reasoning accuracy as complexity grows, along with a scaling limitation: exponentially increasing inference computation leads to only linear performance improvements. These findings highlight the fundamental challenges of scaling LLM reasoning. GSM-Infinite serves as a scalable, controllable testbed for advancing LLM reasoning in long and complex contexts.

Speaker Bio:
Beidi Chen is an Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. She was a Visiting Research Scientist at FAIR, Meta. Before that, she was a postdoctoral scholar at Stanford University. She received her Ph.D. from Rice University in 2020 and B.S. from UC Berkeley in 2015. Her research mainly focuses on developing efficient and scalable AI algorithms and systems. Her work has won a best paper runner-up at ICML 2022, a best paper award at IISA 2018, and a best paper award at USENIX LISA 2014. She was selected as a Rising Star in EECS by MIT in 2019 and UIUC in 2021.