Hailey Schoelkopf

IMG-4282-cropped.jpg

Hi! I’m Hailey (she/her). I am currently a Research Scientist at EleutherAI. There, I study a variety of things AI, ML, and LLMs, but some of my research interests in particular include:

  • Rigorous, reliable evaluation of LLMs and other generative models: how do we create standards for reproducible evaluation of AI models, evaluate them on complex tasks, and build a science of capability testing?
  • The engineering that goes into distributed training and making it fast: I think many of the most important and most interesting questions about our current paradigm are currently engineering questions.
  • The science of scaling models up reliably: most recent progress has come from the systematization of transmuting compute into performance. We should understand these processes better and make our existing recipes even more predictable.

I am currently a maintainer of the LM Evaluation Harness. Some notable projects I’ve worked on include pretraining the Pythia suite of language models, and engineering for the continued pretraining of the Llemma base models for mathematics.

news

Aug 29, 2024 I was a panelist at Princeton Language and Intelligence’s Workshop on Useful and Reliable Agents, discussing our experience maintaining the LM Evaluation Harness and considerations for evaluating LM agents.
Jul 22, 2024 I gave an ICML 2024 tutorial with Lintang Sutawika on “Challenges in LM Evaluation”! For ICML attendees, the recording can be found on the ICML website and the slides are uploaded here. Thank you to all who attended!
Jun 22, 2024 I gave a talk on “Lessons Learned on Effective and Reproducible Evaluations of LLMs” at Cohere For AI’s NLP community group. Thanks for having me!
Jun 11, 2024 I gave a talk on “A Deep Dive on LM Evaluation” for Maven and Parlance Labs’ LLM Fine-Tuning Conference. Thanks to all who attended. Slides can be found here.
Jun 06, 2024 New preprint released: “Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?”

latest posts

selected publications

  1. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
    Stella Biderman*, Hailey Schoelkopf*, Quentin Gregory Anthony, and 10 more authors
    In Proceedings of the 40th International Conference on Machine Learning , 23–29 jul 2023
  2. Lessons from the Trenches on Reproducible Evaluation of Language Models
    Stella Biderman*, Hailey Schoelkopf*, Lintang Sutawika*, and 27 more authors
    23–29 jul 2024