Config file
The configuration file defines all parameters for your SLM training process. This page details each available configuration option, including default values and descriptions.
Format overview
distil labs uses a YAML configuration file with the following structure:
Base configuration
task
Default: none (required)
Options:
classification
question-answering-open-book
information-extraction
question-answering-open-book-synthetic-context
tool-calling-closed-book
Description: Type of NLP task to be solved. This setting enables task-specific behaviors in tuning and data generation.
random_seed
Default: 123
Description: Random seed used across the platform for operations like random sampling of data and dataset splits.
Setup configuration
Note: In the current schema, these fields live under
base:
(there is no separatesetup:
section).
student_model_name
Default: Llama-3.2-1B-Instruct
Options:
Llama-3.2-1B-Instruct
Llama-3.2-3B-Instruct
Llama-3.1-8B-Instruct
SmolLM2-135M-Instruct
granite-3.1-8b-instruct
granite-3.3-8b-instruct
Description: Base model to use for the student model. This is the model we finetune for your use-case.
teacher_model_name
Default: Llama-3.3-70B-Instruct
Options:
deepseek.r1
Llama-3.1-405B-Instruct
Llama-3.1-8B-Instruct
Llama-3.1-70B-Instruct
Llama-3.3-70B-Instruct
openai.gpt-oss-120b
Description: Teacher model used to generate synthetic data and from which we distill knowledge.
Tuning configuration
learning_rate
Default: 5e-5
Description: Initial learning rate for the AdamW optimizer.
Range: > 0
(commonly 1e-6
–1e-3
)
learning_rate_scheduler
Default: linear
Options: cosine
, linear
, constant
Description: Learning-rate schedule type.
weight_decay
Default: 0.0
Description: Weight decay for AdamW (excludes bias and LayerNorm weights).
Range: ≥ 0
(e.g., 0.0–0.1
common)
warmup_ratio
Default: 0.05
Description: Ratio of total training steps used for a linear warmup from 0 to learning_rate
.
Typical range: 0.0–0.1
use_lora
Default: true
Description: Flag to enable LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
lora_alpha_multiplier
Default: 1
Description: Multiplier for LoRA alpha. Effective alpha is computed as lora_alpha = lora_r * lora_alpha_multiplier
.
Range: positive integer
lora_r
Default: 64
Description: LoRA rank (attention dimension).
Range: positive integer (e.g., 4–256
typical)
train_classification_as_textgen
Default: false
Description: Only relevant for classification tasks. When enabled, trains the model as text generation that emits class names as text rather than using a classification head.
per_device_train_batch_size
Default: 4
Description: Batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training. Range: positive integer
per_device_eval_batch_size
Default: 4
Description: Batch size per device for evaluation (prefer this over deprecated evaluation.batch_size
).
Range: positive integer
num_train_epochs
Default: 32
Description: Total number of epochs for fine-tuning. Range: positive integer
train_eval_split
Default: 0.2
Range: (0.0, 1.0)
(exclusive)
Description: Fraction of the training dataset held out for evaluation and best-model selection.
Evaluation configuration
batch_size
Default: 4
(Deprecated)
Description: Batch size to use when evaluating model. Prefer tuning.per_device_eval_batch_size
.
num_few_shot_examples
Default: 1
Description: Number of examples to provide as few-shot context when running teacher evaluation. If the number is above 0 for classification, at least one example per class is used.
Synthetic generation configuration
generation_target
Default: 10000
Description: Target number of synthetic data examples to generate.
Special case: For question-answering-closed-book
, this value is ignored and computed as len(unstructured_data) * generation_per_unstructured_context
.
generation_in_single_call
Default: 4
Description: Number of examples to generate per teacher/LLM invocation.
generation_iteration_size
Default: 128
Description: Number of examples processed in each generate-validate batch.
generation_per_unstructured_context
Default: 50
Description: Number of examples to generate per context in unstructured data.
Usage: Only for question-answering-closed-book
.
num_positive_exemplars_per_generation
Default: 2
Description: Number of in-context examples for the target class/task per generation call.
num_negative_exemplars_per_generation
Default: 2
Description: Number of in-context examples from other classes per generation call. Usage: Only for classification tasks.
validation_max_answer_length
Default: 8192
Description: Maximum allowable length of a generated example/answer during validation.
validation_similarity_threshold
Default: 0.95
Range: 0.0
to 1.0
(inclusive)
Description: Similarity threshold against seed data. Generated samples with similarity above this threshold are removed to promote novelty.
teacher_temperature
Default: 1.0
Range: 0.0
to 1.0
(inclusive)
Description: Controls the balance of predictability vs. creativity in teacher/LLM outputs.
teacher_max_tokens
Default: 4096
Description: Maximum number of tokens in the generated response (provider limits may apply).
match_generated_distribution_to_seed
Default: false
Description: Match generated class distribution to seed data. Usage: Only for classification tasks.
num_unlabelled_exemplars_per_generation
Default: 2
Description: Number of unlabeled examples to include as additional context in each teacher/LLM invocation.
num_distractor_context_blocks
Default: 0
Description: Number of distractor context blocks to include with every generated example (enables RAFT-style training when > 0
).
output_is_json
Default: false
Description: Only relevant for QA tasks. If true
, retain only synthetic data whose outputs are valid JSON.