Distilled Datasets
Last updated
Was this helpful?
Last updated
Was this helpful?
Imagine a "professor" (large model) who is highly knowledgeable but "temperamental": training them requires a huge tuition fee (high training cost), inviting them to give lectures requires a luxurious classroom (high-performance hardware), and each lecture costs a fortune (high inference cost). On the other hand, the "elementary student" (small model) is well-behaved and lightweight (low deployment cost) but has limited knowledge.
Model distillation is the process of having the professor "condense" their problem-solving approach into a "cheat sheet" to teach the student.
The professor doesn't just say "choose A for this question," but provides a probability distribution (e.g., 80% for option A, 20% for option B). This "soft answer" contains their reasoning logic.
By imitating the professor's approach, the student can learn the core knowledge without incurring high costs, much like using a "problem-solving cheat sheet" to quickly grasp the key points.
Simply put: Extract the original dataset and reasoning process from a large model, then fine-tune a smaller model.
While large models are powerful, they face two major challenges in practical applications:
High Computational Requirements: Training a model with hundreds of billions of parameters can cost millions of dollars, making it unaffordable for most companies and individuals.
Deployment Difficulties: Large models require dozens of GBs of memory to run, which exceeds the capacity of ordinary personal devices.
Core Value of Distillation: While individuals and small businesses may not have the resources to deploy large-parameter models, they can distill smaller models for specific domains from large models. This significantly reduces deployment costs while maintaining performance in the target domain.
DeepSeek's series of open-source distilled models:
The paper "s1: Simple test-time scaling" by Fei-Fei Li's team mentioned that for just $50, they trained a model comparable to ChatGPT o1 and DeepSeek R1. This was achieved by fine-tuning the open-source model Qwen2.5-32B from Tongyi, using a dataset partially distilled from Google Gemini 2.0 Flash Thinking.
The creation of this model involved first using knowledge distillation to obtain reasoning trajectories and answers from the Gemini API, which helped filter out 1,000 high-quality data samples. This dataset was then used to fine-tune the Tongyi Qwen2.5-32B model, ultimately resulting in the well-performing s1 model.
Distillation
Small model imitates the problem-solving approach of large models
Lightweight deployment (mobile devices, enterprise private clouds)
Fine-tuning
"Tutoring" the model with specific data (e.g., medical data)
Vertical domain customization (e.g., legal, medical Q&A)
RAG
Model "cheats" by calling external knowledge bases
Enterprise document retrieval (e.g., internal training materials)
Prepare the "Cheat Sheet" (Soft Label Generation)
The "professor" first "solves the problems": Input raw data (e.g., "this movie is great") into the large model to generate probability distributions.
Student "Practices" (Model Training)
The small model takes the same data and outputs its own predictions (e.g., "85% positive, 15% negative"), then compares them with the professor's "cheat sheet" to calculate the difference (loss function).
Through repeated parameter adjustments (backpropagation), the small model's answers gradually align with the professor's thinking.
Incorporate "Standard Answers" (Combining Soft and Hard Labels)
The small model needs to learn both the professor's approach (soft labels) and ensure accuracy on basic questions (hard labels, e.g., "a cat is a cat"). The balance between the two is controlled by a coefficient (α) to prevent overfitting.
In the model distillation process, dataset construction is crucial as it directly determines the quality of the distilled model. The following requirements must be met:
Task Scenario Coverage: The dataset should align with the true distribution of the original task (e.g., image classification, natural language processing) to ensure the features learned by both teacher and student models are meaningful.
Diversity and Balance: The data should include sufficient sample diversity (e.g., different categories, noise levels, edge cases) to prevent the distilled model from having poor generalization due to data bias.
To meet these requirements, we cannot simply extract datasets randomly for specific domains. The approach in Easy Dataset is as follows:
First, we use the top-level topic (defaulting to the project name) to construct a multi-level domain label hierarchy, forming a complete domain tree. Then, we use the "student model" to extract questions from the leaf nodes of this domain tree. Finally, we use the "teacher model" to generate answers and reasoning processes for each question.
Let's create a new project for Physical Education and Sports:
Then, we go to the data distillation module and click to generate top-level tags:
This operation allows us to generate N subtopics (tags) from the top-level topic (defaulting to the project name). The number of subtopics can be customized. After the task succeeds, a preview of the tags will be generated in the dialog:
We can click "Add Sub-tag" on each subtopic to continue generating multiple levels of subtopics:
To ensure the relevance of generated subtopics, the complete tag path will be passed when generating multi-level subtopics:
After building the multi-level domain label tree, we can start extracting questions from the leaf tags:
We can choose the number of questions to generate. Additionally, the complete domain label path will be passed when extracting questions:
After generation is complete, we can preview the questions:
We can see the generated questions from the leaf nodes of the domain tree:
Then, we can click to generate answers for each question:
We can also go to the question management module to batch generate answers for the generated questions (distilled questions will be displayed as "Distilled Content" by default since they are not associated with text chunks):
If you don't need fine-grained control over each step mentioned above, you can choose fully automatic dataset distillation:
In the configuration box, we can see the following options:
Distillation topic (defaults to project name)
Number of levels for the domain tree (default is 2)
Number of tags to generate per level (default is 10)
Number of questions to generate per sub-tag (default is 10)
After the task starts, we can see detailed progress including the specific progress of building tags, questions, and answers: