Hailey Schoelkopf

Hi! I’m Hailey (she/her). I am currently a Research Scientist at EleutherAI. There, I study a variety of things AI, ML, and LLMs, but some of my research interests in particular include:

Rigorous, reliable evaluation of LLMs and other generative models: how do we create standards for reproducible evaluation of AI models, evaluate them on complex tasks, and build a science of capability testing?
The engineering that goes into distributed training and making it fast: I think many of the most important and most interesting questions about our current paradigm are currently engineering questions.
The science of scaling models up reliably: most recent progress has come from the systematization of transmuting compute into performance. We should understand these processes better and make our existing recipes even more predictable.

I am currently a maintainer of the LM Evaluation Harness. Some notable projects I’ve worked on include pretraining the Pythia suite of language models, and engineering for the continued pretraining of the Llemma base models for mathematics.

news

Aug 29, 2024	I was a panelist at Princeton Language and Intelligence’s Workshop on Useful and Reliable Agents, discussing our experience maintaining the LM Evaluation Harness and considerations for evaluating LM agents.
Jul 22, 2024	I gave an ICML 2024 tutorial with Lintang Sutawika on “Challenges in LM Evaluation”! For ICML attendees, the recording can be found on the ICML website and the slides are uploaded here. Thank you to all who attended!
Jun 22, 2024	I gave a talk on “Lessons Learned on Effective and Reproducible Evaluations of LLMs” at Cohere For AI’s NLP community group. Thanks for having me!
Jun 11, 2024	I gave a talk on “A Deep Dive on LM Evaluation” for Maven and Parlance Labs’ LLM Fine-Tuning Conference. Thanks to all who attended. Slides can be found here.
Jun 06, 2024	New preprint released: “Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?”

latest posts

Aug 11, 2024	Prefix Linear Attention Can Outspeed Causal Linear Attention
Jul 09, 2024	Linear Attention Fundamentals

selected publications

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman*, Hailey Schoelkopf*, Quentin Gregory Anthony, and 10 more authors

In Proceedings of the 40th International Conference on Machine Learning , 23–29 jul 2023

Abs PDF

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend Pythia to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at https://github.com/EleutherAI/pythia.
Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman*, Hailey Schoelkopf*, Lintang Sutawika*, and 27 more authors

23–29 jul 2024