Config file | Distil Labs

The configuration file defines all parameters for your SLM training process. This page details each available configuration option, including default values and descriptions.

Format overview

distil labs uses a YAML configuration file with the following structure:

1 base:
2   # General parameters (task is required)
3   task: classification
4 
5 tuning:
6   # Fine-tuning parameters
7   num_train_epochs: 32
8 
9 evaluation:
10   # Evaluation parameters
11   num_few_shot_examples: 1
12 
13 synthgen:
14   # Synthetic data generation parameters
15   generation_target: 10000

Base configuration

task

Default: none (required)

Options:

classification
question-answering-open-book
information-extraction
question-answering-open-book-synthetic-context
tool-calling-closed-book

Description: Type of NLP task to be solved. This setting enables task-specific behaviors in tuning and data generation.

random_seed

Default: 123

Description: Random seed used across the platform for operations like random sampling of data and dataset splits.

Setup configuration

Note: In the current schema, these fields live under base: (there is no separate setup: section).

student_model_name

Default: Llama-3.2-1B-Instruct

Options:

Llama-3.2-1B-Instruct
Llama-3.2-3B-Instruct
Llama-3.1-8B-Instruct
SmolLM2-135M-Instruct
granite-3.1-8b-instruct
granite-3.3-8b-instruct
gemma-3-270m-it
gemma-3-1b-it
gemma-3-4b-it
Qwen3-4B-Instruct-2507
Qwen3-8B

Description: Base model to use for the student model. This is the model we finetune for your use-case.

teacher_model_name

Default: Llama-3.3-70B-Instruct

Options:

deepseek.r1
deepseek.v3.1
Qwen3-235B-A22B-Instruct-2507
openai.gpt-oss-120b-thinking (GPTOSS 120B in thinking mode)
openai.gpt-oss-120b (GPTOSS 120B in non thinking mode)
Llama-3.1-405B-Instruct
Llama-3.1-8B-Instruct
Llama-3.1-70B-Instruct
Llama-3.3-70B-Instruct

Description: Teacher model used to generate synthetic data and from which we distill knowledge.

Tuning configuration

learning_rate

Default: 5e-5

Description: Initial learning rate for the AdamW optimizer. Range: > 0 (commonly 1e-6–1e-3)

learning_rate_scheduler

Default: linear

Options: cosine, linear, constant

Description: Learning-rate schedule type.

weight_decay

Default: 0.0

Description: Weight decay for AdamW (excludes bias and LayerNorm weights). Range: ≥ 0 (e.g., 0.0–0.1 common)

warmup_ratio

Default: 0.05

Description: Ratio of total training steps used for a linear warmup from 0 to learning_rate. Typical range: 0.0–0.1

use_lora

Default: true

Description: Flag to enable LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

lora_alpha_multiplier

Default: 1

Description: Multiplier for LoRA alpha. Effective alpha is computed as lora_alpha = lora_r * lora_alpha_multiplier. Range: positive integer

lora_r

Default: 64

Description: LoRA rank (attention dimension). Range: positive integer (e.g., 4–256 typical)

train_classification_as_textgen

Default: false

Description: Only relevant for classification tasks. When enabled, trains the model as text generation that emits class names as text rather than using a classification head.

per_device_train_batch_size

Default: 4

Description: Batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training. Range: positive integer

per_device_eval_batch_size

Default: 4

Description: Batch size per device for evaluation (prefer this over deprecated evaluation.batch_size). Range: positive integer

num_train_epochs

Default: 32

Description: Total number of epochs for fine-tuning. Range: positive integer

train_eval_split

Default: 0.2

Range: (0.0, 1.0) (exclusive)

Description: Fraction of the training dataset held out for evaluation and best-model selection.

Evaluation configuration

batch_size

Default: 4 (Deprecated)

Description: Batch size to use when evaluating model. Prefer tuning.per_device_eval_batch_size.

num_few_shot_examples

Default: 1

Description: Number of examples to provide as few-shot context when running teacher evaluation. If the number is above 0 for classification, at least one example per class is used.

Synthetic generation configuration

generation_target

Default: 10000

Description: Target number of synthetic data examples to generate. Special case: For question-answering-closed-book, this value is ignored and computed as len(unstructured_data) * generation_per_unstructured_context.

generation_in_single_call

Default: 4

Description: Number of examples to generate per teacher/LLM invocation.

generation_iteration_size

Default: 128

Description: Number of examples processed in each generate-validate batch.

generation_per_unstructured_context

Default: 50

Description: Number of examples to generate per context in unstructured data. Usage: Only for question-answering-closed-book.

num_positive_exemplars_per_generation

Default: 2

Description: Number of in-context examples for the target class/task per generation call.

num_negative_exemplars_per_generation

Default: 2

Description: Number of in-context examples from other classes per generation call. Usage: Only for classification tasks.

validation_max_answer_length

Default: 8192

Description: Maximum allowable length of a generated example/answer during validation.

validation_similarity_threshold

Default: 0.95 Range: 0.0 to 1.0 (inclusive)

Description: Similarity threshold against seed data. Generated samples with similarity above this threshold are removed to promote novelty.

teacher_temperature

Default: 1.0 Range: 0.0 to 1.0 (inclusive)

Description: Controls the balance of predictability vs. creativity in teacher/LLM outputs.

teacher_max_tokens

Default: 4096

Description: Maximum number of tokens in the generated response (provider limits may apply).

match_generated_distribution_to_seed

Default: false

Description: Match generated class distribution to seed data. Usage: Only for classification tasks.

num_unlabelled_exemplars_per_generation

Default: 2

Description: Number of unlabeled examples to include as additional context in each teacher/LLM invocation.

num_distractor_context_blocks

Default: 0

Description: Number of distractor context blocks to include with every generated example (enables RAFT-style training when > 0).

output_is_json

Default: false

Description: Only relevant for QA tasks. If true, retain only synthetic data whose outputs are valid JSON.