Config file

The configuration file defines all parameters for your SLM training process. This page details each available configuration option, including default values and descriptions.

Format Overview

distil labs uses a YAML configuration file with the following structure:

1base:
2 # General parameters
3 task: classification
4
5setup:
6 # Model setup parameters
7 student_model_name: meta-llama/Llama-3.2-1B-Instruct
8
9tuning:
10 # Fine-tuning parameters
11 num_train_epochs: 32
12
13evaluation:
14 # Evaluation parameters
15 num_few_shot_examples: 1
16
17synthgen:
18 # Synthetic data generation parameters
19 data_generation_strategy: classification-one-class-context

Base Configuration

task

Default: classification

Options:

  • classification
  • contextual-classification
  • question-answering-open-book
  • information-extraction

Description: Type of NLP task to be solved.

random_seed

Default: 123

Description: Random seed used across the platform for operations like random sampling of data.

debug

Default: false

Description: Flag to enable debug mode. If set to true, synthetic data generation ends after one iteration.

Setup Configuration

student_model_name

Default: meta-llama/Llama-3.2-1B-Instruct

Options:

  • meta-llama/Llama-3.2-1B-Instruct
  • meta-llama/Llama-3.2-3B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct
  • HuggingFaceTB/SmolLM2-135M-Instruct

Description: Base model to use for the student model. This is the model we finetune for your use-case.

teacher_model_name

Default: us.meta.llama3-3-70b-instruct-v1:0

Options:

  • meta.llama3-1-405b-instruct-v1:0
  • meta.llama3-8b-instruct-v1:0
  • meta.llama3-1-70b-instruct-v1:0
  • us.meta.llama3-3-70b-instruct-v1:0

Description: Teacher model used to generate synthetic data and from which we distill knowledge.

Tuning Configuration

use_lora

Default: true

Description: Flag to control whether to use LoRA (Low-Rank Adaptation) for student training, a parameter-efficient fine-tuning method.

train_classification_as_textgen

Default: false

Description: Only relevant for classification tasks. When enabled, trains the classification model to generate class names as text instead of using a classification head.

per_device_train_batch_size

Default: 4

Description: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training.

per_device_eval_batch_size

Default: 4

Description: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for evaluation.

num_train_epochs

Default: 128

Description: Total number of training epochs to perform during fine-tuning.

Evaluation Configuration

batch_size

Default: 4

Description: Batch size to use when evaluating model. (Deprecated)

num_few_shot_examples

Default: 1

Description: Number of examples to provide as few-shot examples when running teacher evaluation. For classification tasks with values above 0, at least one example per class is used.

Synthetic Generation Configuration

generation_target

Default: 5000

Description: Target number of synthetic data examples to generate.

generation_in_single_call

Default: 4

Description: Number of examples to generate per teacher/LLM invocation.

generation_iteration_size

Default: 128

Description: Number of examples to generate in a single batch. The generate-validate process runs in batches of this size.

num_positive_exemplars_per_generation

Default: 2

Description: The number of in-context examples for the class/task to be generated.

num_negative_exemplars_per_generation

Default: 2

Description: The number of in-context examples for classes that are not to be generated. Only used for classification tasks.

validation_max_answer_length

Default: 8192

Description: Maximum allowable length of generated examples.

validation_similarity_threshold

Default: 0.95

Range: 0.0 to 1.0

Description: Threshold to determine similarity of generated examples with seed data. Generated data with similarity above this threshold are removed.

teacher_temperature

Default: 1.0

Range: 0.0 to 1.0

Description: Controls the balance of predictability vs creativity of the teacher/LLM output.

teacher_max_tokens

Default: 4096

Description: Maximum number of tokens in the generated response.

match_generated_distribution_to_seed

Default: false

Description: When enabled, generated data will match seed data in terms of class distribution. Only used for classification tasks.

num_unlabelled_exemplars_per_generation

Default: 2

Description: Number of unlabelled examples to provide during each teacher/LLM invocation when generating synthetic data.

data_generation_strategy

Default: classification-one-class-context

Options:

  • Classification options:
    • classification-one-class
    • classification-all-class
    • classification-all-class-context
    • classification-all-class-weak-labels
    • classification-one-class-context
    • classification-one-class-weak-labels
    • contextual-classification-all-class
  • Question Answering options:
    • qa-open-book
    • qa-open-book-with-synthetic-context
    • qa-open-book-information-extraction

Description: Strategy to use when generating synthetic training data.

Example Configuration

1base:
2 task: question-answering-open-book
3setup:
4 student_model_name: meta-llama/Llama-3.2-1B-Instruct
5synthgen:
6 data_generation_strategy: qa-open-book
7tuning:
8 num_train_epochs: 32

This configuration sets up an open-book QA task using a 1B parameter model with 32 training epochs.