Config file

The configuration file controls the training pipeline through four main sections: base, tuning, evaluation, and synthgen. Each section handles a specific aspect of the model training process.

File format

The configuration file supports two formats depending on how you interact with distil labs:

  • Webapp: Use JSON format (.json file)
  • API: Use YAML format (.yaml file)

Both formats are functionally equivalent—choose based on your workflow. Examples in this documentation show YAML, but the JSON equivalent is straightforward:

1# YAML (API)
2base:
3 task: information-extraction
4 student_model_name: Llama-3.2-3B-Instruct
1// JSON (Webapp)
2{
3 "base": {
4 "task": "information-extraction",
5 "student_model_name": "Llama-3.2-3B-Instruct"
6 }
7}

Configuration structure

1base:
2 # General parameters (task is required)
3 task: classification
4
5tuning:
6 # Fine-tuning parameters
7 num_train_epochs: 32
8
9evaluation:
10 # Evaluation parameters
11 num_few_shot_examples: 1
12
13synthgen:
14 # Synthetic data generation parameters
15 generation_target: 10000

Base configuration

General parameters relevant to the overall task.

ParameterTypeDefaultDescription
taskstringrequiredType of NLP task to be solved. See task types for available options.
student_model_namestringLlama-3.2-1B-InstructBase model to use for the student model. This is the model we finetune for your use-case.
teacher_model_namestringLlama-3.3-70B-InstructTeacher model used to generate synthetic data and from which we distil knowledge.
random_seedinteger | null123Random seed used across distillib for reproducible random sampling.

Supported student models

ModelValue
Llama 3.2 1B InstructLlama-3.2-1B-Instruct
Llama 3.2 3B InstructLlama-3.2-3B-Instruct
Llama 3.1 8B InstructLlama-3.1-8B-Instruct
SmolLM2 135MSmolLM2-135M-Instruct
Gemma 3 270Mgemma-3-270m-it
Gemma 3 1Bgemma-3-1b-it
Qwen3 0.6BQwen3-0.6B
Qwen3 1.7BQwen3-1.7B
Qwen3 4BQwen3-4B-Instruct-2507
Qwen3 8BQwen3-8B
IBM Granite 3.1 8Bgranite-3.1-8b-instruct
IBM Granite 3.3 8Bgranite-3.3-8b-instruct

Supported teacher models

ModelValue
DeepSeek R1deepseek.r1
DeepSeek V3.1deepseek.v3.1
Qwen3 235B A22BQwen3-235B-A22B-Instruct-2507
Qwen3 480B A35B CoderQwen3-480B-A35B-Coder
Qwen2.5 VL 72BQwen2.5-VL-72B-Instruct
Llama 3.1 405B InstructLlama-3.1-405B-Instruct
Llama 3.1 8B InstructLlama-3.1-8B-Instruct
Llama 3.3 70B InstructLlama-3.3-70B-Instruct
GPT OSS 20Bopenai.gpt-oss-20b
GPT OSS 120Bopenai.gpt-oss-120b
GPT OSS 120B Thinkingopenai.gpt-oss-120b-thinking

Tuning configuration

Parameters controlling the finetuning of the student model.

ParameterTypeDefaultDescription
learning_ratefloat5e-5The initial learning rate for AdamW optimizer.
learning_rate_schedulerstringlinearThe scheduler type to use. Options: cosine, linear, constant.
weight_decayfloat0.0Weight decay applied to all layers except bias and LayerNorm weights in AdamW optimizer.
warmup_ratiofloat0.05Ratio of total training steps used for linear warmup from 0 to learning_rate.
use_lorabooleantrueWhether to use LoRA for student training.
lora_rinteger64LoRA attention dimension (rank). Only used if use_lora is true.
lora_alpha_multiplierinteger1Alpha parameter for LoRA scaling is lora_r * lora_alpha_multiplier. Only used if use_lora is true.
train_classification_as_textgenbooleanfalseTrain classification model as text-generation that generates class names. Only relevant for classification tasks.
per_device_train_batch_sizeinteger1Batch size per GPU/device for training.
per_device_eval_batch_sizeinteger1Batch size per GPU/device for evaluation.
num_train_epochsinteger4Total number of training epochs.
train_eval_splitfloat0.2Fraction of training data used for evaluation. Must be between 0 and 1 (exclusive).
num_few_shot_examples_studentinteger0Number of few-shot examples when running student evaluation and tuning. If above 0, at least one example per class is used for classification tasks.

Evaluation configuration

Parameters used in teacher evaluation.

ParameterTypeDefaultDescription
num_few_shot_examplesinteger1Number of few-shot examples when running teacher evaluation. If above 0, at least one example per class is used for classification tasks.
batch_sizeinteger4(Deprecated) Batch size for model evaluation.

Synthetic generation configuration

Parameters for fine-grained control over synthetic data generation.

ParameterTypeDefaultDescription
generation_targetinteger10000Target number of synthetic examples to generate. For Closed-Book QA, this is calculated as len(unstructured_data) * generation_per_unstructured_context.
generation_in_single_callinteger4Number of examples to generate per teacher/LLM invocation.
generation_iteration_sizeinteger128Batch size for the generate-validate cycle.
generation_per_unstructured_contextinteger | nullnullExamples to generate per unstructured context. Only used with question-answering-closed-book task. Overwrites generation_target when set.
num_positive_exemplars_per_generationinteger2Number of in-context examples for the class/task being generated.
num_negative_exemplars_per_generationinteger2Number of in-context examples for classes not being generated. Only used for classification tasks.
num_unlabelled_exemplars_per_generationinteger2Number of unlabelled examples provided during each teacher invocation.
validation_max_total_lengthinteger10000Maximum total length (input + output) of generated examples in characters.
validation_similarity_thresholdfloat0.95Similarity threshold for deduplication. Generated data with similarity above this threshold to seed data are removed.
validation_max_answer_lengthinteger8192(Deprecated) Use validation_max_total_length instead.
teacher_temperaturefloat0.7Temperature for teacher output. Controls balance between predictability and creativity. Must be between 0.0 and 1.0.
teacher_max_tokensinteger | nullnullMaximum tokens in the generated response.
match_generated_distribution_to_seedbooleanfalseMatch generated data class distribution to seed data. Only used for classification tasks.
num_distractor_context_blocksinteger0Number of distractor context blocks per example. Setting above zero enables RAFT training.
output_is_jsonbooleanfalseOnly generate synthetic data with valid JSON outputs. Only relevant for QA tasks.

Example configuration

Minimal configuration

1base:
2 task: information-extraction
3 student_model_name: Llama-3.2-3B-Instruct
4 teacher_model_name: openai.gpt-oss-120b

Full configuration example

1base:
2 task: question-answering-open-book
3 student_model_name: Qwen3-1.7B
4 teacher_model_name: openai.gpt-oss-120b
5 random_seed: 42
6
7tuning:
8 learning_rate: 1e-4
9 learning_rate_scheduler: cosine
10 use_lora: true
11 lora_r: 32
12 num_train_epochs: 3
13 train_eval_split: 0.15
14
15evaluation:
16 num_few_shot_examples: 2
17
18synthgen:
19 generation_target: 5000
20 generation_in_single_call: 8
21 teacher_temperature: 0.6
22 validation_similarity_threshold: 0.9

Model-specific notes

DeepSeek R1

When using deepseek.r1 as the teacher model, the recommended temperature range is 0.5 to 0.7. Configurations with temperatures outside this range will raise a validation error.

GPT OSS 120B Thinking

The openai.gpt-oss-120b-thinking model uses a medium reasoning effort setting by default for enhanced chain-of-thought capabilities.