Distil Labs Platform - Self-Onboarding
Contents
-
- Overview and objectives
-
- Installation
-
- Create a model and select a task type
-
- Prepare and upload data
-
- Run teacher evaluation
-
- Train and deploy model
1. Overview and objectives
The aim of this tutorial is to get you familiarised with the distil labs platform so that you can train your own models. We will go through the capabilities of the platform and deep dive into the data preparation requirements which are needed to start training models. This should set you up with the basics and empower you to train many more models.
The key learning objectives are:
- Understand the distil labs platform
- Learn how to prepare input data
- Learn how to iterate on your models
2. Installation
Run the following command to install the distil labs CLI:
curl -fsSL https://cli-assets.distillabs.ai/install.sh | sh
To ensure everything is installed correctly, run the command:
distil
And you should see the available commands.
If you haven’t already registered, you can do so using the command:
distil register
If you already have an account, you can enter your username and password when running the command:
distil login
To verify you have successfully logged in, run:
distil whoami
3. Create a model and select a task type
Create a model
Run the following command to create a new model:
You should see the following output in your terminal:
Select a task type
The distil labs platform enables you to finetune task-specific small language models. This means that the models are specialized for a single, well-defined task. Note that this contrasts to the large language models you may have used previously which are general-purpose models. Our platform supports several task types such as classification, question-answering, tool-calling etc. You can find a complete list here.
In this tutorial, we will train a model for a question-answering problem. Specifically, we will train a model to redact personal information from snippets of text (data can be found here). The following is an example input:
And the corresponding redacted output is:
From the docs, a question-answering task where “the model extracts or generates precise answers from text based on specific queries”. This fits our use-case the best.
4. Prepare and upload data
To get started, you need to prepare the following:
- Train dataset: A small dataset of ~50 examples with a good coverage of the inputs we can expect.
- Test dataset: This should cover as much of the distribution as possible so that the model is thoroughly evaluated.
- Job description: This contains three different components:
- Task description: A descriptive explanation of which task you are trying to solve.
- Input description: A description about what the inputs consist of.
- LLM-as-a-judge instructions: To evaluate whether the model’s predictions are correct, we use an LLM-as-a-judge approach and here you can provide extra evaluation criteria that must be met to determine a correct answer.
- Config: You can use the config to change the default parameters that control the distillation/training pipeline. You can change parameters such as the student model you want to finetune, the teacher model you want to distil from, the learning rate of the training, the number of synthetic training data you generate etc. In most cases, you can leave the sensible defaults we have chosen.
Job description
We provide the job description as a JSON file and the general form of this file is:
For this use-case, the task_description is:
Note how comprehensive the description is. The more useful detail and context we provide here the better.
The input_description is:
This provides details on the kinds of inputs we can expect.
The llm_as_a_judge_instructions is:
For this problem, there is specific criteria that should be met before an answer is marked as correct. This is the place to define those criteria.
Config
A simple config for this problem is:
The task parameter simply describes the task type of our problem whilst student_model_name defines which small language model we want to finetune. The teacher_model_name defines the LLM that we want to distil from. In practice, this means we generate synthetic data using that teacher model, which in turn is used to finetune the small language model. Given that the outputs in this dataset are JSON objects, we set output_is_json to True which enforces JSON schema validation during generation.
You can find details on what you can change here: https://docs.distillabs.ai/how-to/data-preparation/config.
Train/test datasets
We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).
The expected format of both these datasets is JSONL or CSV.
JSONL:
CSV:
Practical considerations when preparing train/test sets:
It is important to ensure there is a balance between your train/test sets or else downstream model performance will be impacted. For example, if our train dataset contains only simple examples and our test set contains more complex examples, then our model will be optimised for solving the simple examples and will struggle with the complex examples.
There are multiple dimensions beyond complexity that we should consider when preparing the data. Below are some examples to help you understand the kinds of dimensions that could be important when trying to maintain this balance between the train/test set:
- sentence length
- vocabulary diversity
- formal vs informal
- subject matter
- dialects/regionalisms
Unstructured data (optional)
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific synthetic training data. For question answering, provide realistic samples that represent the types of inputs your model will encounter.
Upload data
To upload data, run the following command:
(Note that this is the directory that contains: train.jsonl, test.jsonl, config.yaml, job_description.json, unstructured.jsonl (optional))
5. Run teacher evaluation
Before we run the full training pipeline, we run the teacher evaluation. The purpose of this step is to check how well a large model can solve your task. We train small models using knowledge distillation - this means extracting knowledge from larger models and distilling them into smaller models. If a large model cannot solve a particular problem, it is unlikely the small model will extract anything useful.
As such, we run this evaluation in an iterative fashion where we can tweak our inputs (mostly the job description and config) until the teacher model can solve our task to our satisfaction. This process typically takes several minutes and so allows us to iterate reasonably fast.
To run the teacher evaluation we use the command:
distil model run-teacher-evaluation <MODEL_ID>
Note that this will run the teacher evaluation on the latest data that you upload. If you find the results are not satisfactory, you can go back to the upload data step and upload new data. If you then run the teacher evaluation step again, it will use the latest data you have uploaded.
This will take a few minutes to complete, after which you can check the results using the command:
distil model teacher-evaluation <MODEL_ID>
If you are satisfied with the performance, we can proceed to running the full distillation pipeline.
If the performance of the model is not good, you may want to look at the individual test examples and see the outputs. This can help understand the mistakes the teacher model is making, therefore giving us a direction in what to change to improve the results. Alternatively, it may highlight issues in the data, where the LLM output is good but the ground truth answers are problematic.
To get the individual outputs, run the following command:
distil model teacher-evaluation <MODEL_ID> -o json | jq ".evaluation_predictions_download_url"
This will output a URL in your terminal. You can use this URL to read data into pandas (or other dataframe library of your choice) using:
6. Train and deploy model
Train model
Once you are satisfied with the teacher evaluation process, you can proceed with the full training of the model.
To run the full distillation pipeline, run:
This process can take 6+ hours to run so you can periodically check the status of the training using the command:
This should output something like:
When the training is complete, you can the following command to get the results:
Iterating on your model
If you have already trained a model and are looking for ways to iterate on it for improvements, then you will likely need to make adjustments to the job description and config (to override the defaults we have set). Writing a good job description requires a careful balance of being succinct but not too vague and detailed without being lengthy. As such, we expect tweaks to be made to the job description as you run the teacher evaluation and analyse the outputs. Often, this gives clues as to what is missing from your job description.
The config gives you fine-grained control to influence many aspect of the pipeline although in practice, for most use-cases, only a handful of these will need to be changed. Below is a list of the most useful ones:
student_model_name: Choice of student model to train. Size of the models varies from ~15mn - 8bn. Generally, the more complex a use-case, the larger you will need your model to be.teacher_model_name: Choice of teacher model - used to generate synthetic training data. Different teacher have different strengths i.e. Deepseek/Qwen have been trained on 100+ languages and so are good for multilingual use-cases.generation_target: How much data to generate. We use a default of 10k, but maybe your use-case requires more.num_positive_exemplars_per_generation: How many examples from the training data to provide the teacher model when generating new examples.lora_r: This parameter controls the number of parameters we train. A larger number means we train more parameters thereby adjusting a greater proportion of the base model. How much of the base model we want to adjust/change is also dependent on complexity of use-case. For simple use-cases, we don’t need to adjust the base model so much and so the default will suffice.
You can find details on what you can change here: https://docs.distillabs.ai/how-to/data-preparation/config.
Deploy model
The distil labs CLI offers functionality that allows you to deploy the model either locally (using llama.cpp) or remote on our playground environment. Note that the remote deployments are not suitable for production and we do not make any guarantees on their uptime.
The command to deploy a model locally is:
And for deploying a model remotely:
To destroy your remote deployment, you can use:
