Platform Onboarding | Distil labs

1. Overview and objectives
1. Installation
1. Create a model and select a task type
1. Prepare and upload data
1. Run teacher evaluation
1. Train and deploy model

1. Overview and objectives

The aim of this tutorial is to get you familiarised with the distil labs platform so that you can train your own models. We will go through the capabilities of the platform and deep dive into the data preparation requirements which are needed to start training models. This should set you up with the basics and empower you to train many more models.

The key learning objectives are:

Understand the distil labs platform
Learn how to prepare input data
Learn how to iterate on your models

2. Installation

Run the following command to install the distil labs CLI:

curl -fsSL https://cli-assets.distillabs.ai/install.sh | sh

To ensure everything is installed correctly, run the command:

distil

And you should see the available commands.

If you haven’t already registered, you can do so using the command:

distil register

If you already have an account, you can enter your username and password when running the command:

distil login

To verify you have successfully logged in, run:

distil whoami

3. Create a model and select a task type

Create a model

Run the following command to create a new model:

distil model create my-first-model

You should see the following output in your terminal:

$ ID:                                     b1def4d6-b100-439a-cf3d-9bd4150192da
$ Name:                                   my-first-model
$ Created At:                             2026-01-12T08:43:27.026140Z
$ Training Status:                        ○ Not started
$ Training:                               None
$ Upload IDs (latest first):              None
$ Teacher Evaluation IDs (latest first):  None

Select a task type

The distil labs platform enables you to finetune task-specific small language models. This means that the models are specialized for a single, well-defined task. Note that this contrasts to the large language models you may have used previously which are general-purpose models. Our platform supports several task types such as classification, question-answering, tool-calling etc. You can find a complete list here.

In this tutorial, we will train a model for a question-answering problem. Specifically, we will train a model to redact personal information from snippets of text (data can be found here). The following is an example input:

John's phone is +1 (415) 555-0199 and email is john@acme.co.

And the corresponding redacted output is:

[PERSON]'s phone is [PHONE] and email is [EMAIL].

From the docs, a question-answering task where “the model extracts or generates precise answers from text based on specific queries”. This fits our use-case the best.

4. Prepare and upload data

To get started, you need to prepare the following:

Train dataset: A small dataset of ~50 examples with a good coverage of the inputs we can expect.
Test dataset: This should cover as much of the distribution as possible so that the model is thoroughly evaluated.
Job description: This contains three different components:
- Task description: A descriptive explanation of which task you are trying to solve.
- Input description: A description about what the inputs consist of.
- LLM-as-a-judge instructions: To evaluate whether the model’s predictions are correct, we use an LLM-as-a-judge approach and here you can provide extra evaluation criteria that must be met to determine a correct answer.
Config: You can use the config to change the default parameters that control the distillation/training pipeline. You can change parameters such as the student model you want to finetune, the teacher model you want to distil from, the learning rate of the training, the number of synthetic training data you generate etc. In most cases, you can leave the sensible defaults we have chosen.

Job description

We provide the job description as a JSON file and the general form of this file is:

1 {
2   "task_description": "...",
3   "input_description": "...",
4   "llm_as_a_judge_instructions": "..."
5 }

For this use-case, the task_description is:

Redact sensitive personal data from text while preserving operational context. Return JSON with:
* **redacted_text**: input with PII replaced by tokens
* **entities**: array of `{value, replacement_token, reason}`
## Redact (→ token)
* PERSON — names, identifying initials → `[PERSON]`
* EMAIL — including obfuscated `(at)/(dot)` → `[EMAIL]`
* PHONE — any format → `[PHONE]`
* ADDRESS — street+number, full postal → `[ADDRESS]`
* SSN — US Social Security → `[SSN]`
* ID — national IDs (PESEL, NIN, Aadhaar, DNI) → `[ID]`
* UUID — person-scoped system IDs (MRN, patient/customer IDs) → `[UUID]`
* CREDIT_CARD — 13-19 digits → `[CARD_LAST4:####]`
* IBAN — bank accounts → `[IBAN_LAST4:####]`
* GENDER/AGE/RACE/MARITAL_STATUS — demographic self-ID → `[GENDER]` `[AGE_YEARS:##]` `[RACE]` `[MARITAL_STATUS]`
## Do NOT redact
Last-4 only references, operational IDs (order/ticket/invoice/tracking), company/product names, standalone cities/countries.
## Output schema
```json
{
  "redacted_text": "Hi, I'm [PERSON] and my email is [EMAIL].",
  "entities": [
    {"value": "John Smith", "replacement_token": "[PERSON]", "reason": "person name"},
    {"value": "john@example.com", "replacement_token": "[EMAIL]", "reason": "email"}
  ]
}

Note how comprehensive the description is. The more useful detail and context we provide here the better.

The input_description is:

UTF-8 text from any channel: email, chat, SMS, tickets, forms, call transcripts, logs, HTML/Markdown, JSON/CSV, OCR.
## Formats
Free prose, threaded dialogs, key-value pairs, structured payloads, markup with `mailto:`/`tel:` links, system logs, OCR artifacts.
## Domains
Support, healthcare, fintech, HR, e-commerce, education, legal, social.
## Variations
International name/address/phone formats, obfuscated emails, regional IDs (SSN/NIN/PESEL/Aadhaar/DNI), various IBAN/card formats, demographic statements.
## Noise
Typos, mixed case, partial masking, unicode quirks, broken lines.
## PII density
Light to heavy, overlapping entities, boundary cases.

This provides details on the kinds of inputs we can expect.

The llm_as_a_judge_instructions is:

If all checks below pass the prediction is good. If one of them fails, the prediction is bad
* JSON validity: Preduction can be parsed and has redacted_text (string) and entities (array).
* Entity shape : every entity has fields: value, replacement_token, reason
* Redacted text equality: prediction - redacted_text is the same as reference - redacted_text.
* Entity equality: (value, replacement_token) pairs in prediction - entities equals the (value, replacement_token) pairs in reference.entities. Order and reason strings should be ignored

For this problem, there is specific criteria that should be met before an answer is marked as correct. This is the place to define those criteria.

Config

A simple config for this problem is:

base:
  task: question-answering
  student_model_name: Qwen3-1.7B
  teacher_model_name: openai.gpt-oss-120b
synthgen:
  output_is_json: True

The task parameter simply describes the task type of our problem whilst student_model_name defines which small language model we want to finetune. The teacher_model_name defines the LLM that we want to distil from. In practice, this means we generate synthetic data using that teacher model, which in turn is used to finetune the small language model. Given that the outputs in this dataset are JSON objects, we set output_is_json to True which enforces JSON schema validation during generation.

You can find details on what you can change here: https://docs.distillabs.ai/how-to/data-preparation/config.

Train/test datasets

We need a training dataset to fine-tune the model for your specific task and a test dataset to evaluate its performance after fine-tuning. The more diverse and bigger the datasets, the better, but for the training stage we need only a few dozen examples (we’ll generate much more based on the examples you provide).

The expected format of both these datasets is JSONL or CSV.

JSONL:

{"question": "question1", "answer": "answer1"}
{"question": "question2", "answer": "answer2"}
...

CSV:

question,answer
question1,answer1
question2,answer2
...

Practical considerations when preparing train/test sets:

It is important to ensure there is a balance between your train/test sets or else downstream model performance will be impacted. For example, if our train dataset contains only simple examples and our test set contains more complex examples, then our model will be optimised for solving the simple examples and will struggle with the complex examples.

There are multiple dimensions beyond complexity that we should consider when preparing the data. Below are some examples to help you understand the kinds of dimensions that could be important when trying to maintain this balance between the train/test set:

sentence length
vocabulary diversity
formal vs informal
subject matter
dialects/regionalisms

Unstructured data (optional)

The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific synthetic training data. For question answering, provide realistic samples that represent the types of inputs your model will encounter.

Upload data

To upload data, run the following command:

distil model upload-data <MODEL_ID> --data /Users/john/Documents/data

(Note that this is the directory that contains: train.jsonl, test.jsonl, config.yaml, job_description.json, unstructured.jsonl (optional))

5. Run teacher evaluation

Before we run the full training pipeline, we run the teacher evaluation. The purpose of this step is to check how well a large model can solve your task. We train small models using knowledge distillation - this means extracting knowledge from larger models and distilling them into smaller models. If a large model cannot solve a particular problem, it is unlikely the small model will extract anything useful.

As such, we run this evaluation in an iterative fashion where we can tweak our inputs (mostly the job description and config) until the teacher model can solve our task to our satisfaction. This process typically takes several minutes and so allows us to iterate reasonably fast.

To run the teacher evaluation we use the command:

distil model run-teacher-evaluation <MODEL_ID>

Note that this will run the teacher evaluation on the latest data that you upload. If you find the results are not satisfactory, you can go back to the upload data step and upload new data. If you then run the teacher evaluation step again, it will use the latest data you have uploaded.

This will take a few minutes to complete, after which you can check the results using the command:

distil model teacher-evaluation <MODEL_ID>

If you are satisfied with the performance, we can proceed to running the full distillation pipeline.

If the performance of the model is not good, you may want to look at the individual test examples and see the outputs. This can help understand the mistakes the teacher model is making, therefore giving us a direction in what to change to improve the results. Alternatively, it may highlight issues in the data, where the LLM output is good but the ground truth answers are problematic.

To get the individual outputs, run the following command:

distil model teacher-evaluation <MODEL_ID> -o json | jq ".evaluation_predictions_download_url"

This will output a URL in your terminal. You can use this URL to read data into pandas (or other dataframe library of your choice) using:

df = pd.read_json(
  "<URL>",
  lines=True
)

6. Train and deploy model

Train model

Once you are satisfied with the teacher evaluation process, you can proceed with the full training of the model.

To run the full distillation pipeline, run:

distil model run-training <MODEL_ID>

This process can take 6+ hours to run so you can periodically check the status of the training using the command:

distil model training <MODEL_ID>

This should output something like:

$ Training ID:                            b1def4d6-b100-439a-cf3d-9bd4150192da
$ Status:                                 ◐ Distilling
$ Message:
$ Start Time:                             1/12/2026, 9:51:24 AM
$ End Time:                               —
$ 
$ Tasks:
$   Evaluate Teacher:              ✓ Success
$   Generate Synthetic Data:       ✓ Success
$   Finetune Student:              ◐ Running

When the training is complete, you can the following command to get the results:

distil models show <MODEL_ID>

Iterating on your model

If you have already trained a model and are looking for ways to iterate on it for improvements, then you will likely need to make adjustments to the job description and config (to override the defaults we have set). Writing a good job description requires a careful balance of being succinct but not too vague and detailed without being lengthy. As such, we expect tweaks to be made to the job description as you run the teacher evaluation and analyse the outputs. Often, this gives clues as to what is missing from your job description.

The config gives you fine-grained control to influence many aspect of the pipeline although in practice, for most use-cases, only a handful of these will need to be changed. Below is a list of the most useful ones:

student_model_name: Choice of student model to train. Size of the models varies from ~15mn - 8bn. Generally, the more complex a use-case, the larger you will need your model to be.
teacher_model_name: Choice of teacher model - used to generate synthetic training data. Different teacher have different strengths i.e. Deepseek/Qwen have been trained on 100+ languages and so are good for multilingual use-cases.
generation_target: How much data to generate. We use a default of 10k, but maybe your use-case requires more.
num_positive_exemplars_per_generation: How many examples from the training data to provide the teacher model when generating new examples.
lora_r: This parameter controls the number of parameters we train. A larger number means we train more parameters thereby adjusting a greater proportion of the base model. How much of the base model we want to adjust/change is also dependent on complexity of use-case. For simple use-cases, we don’t need to adjust the base model so much and so the default will suffice.