BARE – UC Berkeley Sky Computing Lab

Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation

Instruct-tuned models are getting better at following instructions and ‘reasoning’ every day, but they’re shockingly poor at generating diverse responses. LLMs need diverse, high-quality synthetic data to train well, and we hypothesize this shortcoming in common methods hinders downstream performance.

As a motivating example, when we sampled GPT-4o mini 100 times at temperature=1, we only got 4 distinct jokes (their quality can be debated):

1. Why did the scarecrow win an award? Because he was outstanding in his field!
2. Why don't scientists trust atoms? Because they make up everything!
3. Why don’t skeletons fight each other? They don’t have the guts!
4. Why do seagulls fly over the ocean? Because if they flew over the bay, they’d be bagels!

We introduce Base-Refine (BARE 🐻), a method for combining base language models and instruction-tuned language models for better synthetic data generation.

1️⃣ Generate diverse but potentially lower quality synthetic data with a base model.
2️⃣ Refine each individual data point for quality with an instruction-tuned model.
3️⃣ Fine-tune models for downstream tasks with the final dataset.

Beyond generating training data, the idea of sampling diverse and high-quality responses from LLMs has both a large design space and broad applications — such as creating evaluation data, generating trajectories, etc.

GitHub

Contributors

Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

Publications

CoRR – BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation