Introduction
Welcome to the Distil Labs hands‑on tutorial for fine-tuning and deploying your own domain-specialized assistant.
In this tutorial, you’ll learn how to:
- Fine-tune a small language model (SLM) for a custom open-book question answering task using the Distil Labs platform.
- Deploy a fully local Retrieval-Augmented Generation (RAG) system, where the fine-tuned model answers questions based on an external knowledge source.
Our focus is on building an assistant that can answer questions about the Roman Empire using just a single Wikipedia page as context. You will walk through the full lifecycle—from understanding your dataset, to fine-tuning a 135M-parameter model, to deploying a self-contained RAG pipeline that runs entirely on your machine. To visualise the size of the model, take a look at the following comparison between sizes of a frontier model GPT4, llama8B which is normalluy considered a “small language model” and the 100M model we will be training in the tutorial.
Despite its compact size, the fine-tuned SLM will deliver performance close to much larger models—demonstrating how domain specialization and efficient distillation can unlock powerful capabilities on resource-constrained hardware.
By the end, you’ll have a functional, local QA assistant—built with minimal data, no ML expertise, and zero dependency on cloud-based LLMs.
Registration
The first step towards model distillation is creating an account at app.distillabs.ai. Once you sign up, you can use your email/password combination in the authentification section below.
Notebook Setup
Copy over necessary data
Install python libraries
Step 1: Understand your data
Before we can specialize a model or build a retrieval‑augmented generation (RAG) pipeline, we need to inspect the knowledge source we’ll be working with. In this tutorial, our task is: answer questions about the Roman Empire.
Why bother looking at the raw data first?
• It clarifies the scope (what’s in and what’s out of domain).
• It helps us spot formatting issues or noisy sections.
• It lets us craft realistic evaluation questions early on.
Retrieve a reference article
To keep things quick, we’ll use a single reference document: the English Wikipedia page for the Roman Empire. In a production system you’d likely combine multiple sources, but one page is enough to demo the workflow.
Sample questions we want to answer
Let’s jot down a few questions that our finished system should handle. Capturing these early gives us a mini test‑set for later.
Why do we specialize the SLM
Here are the answers generated by the base HuggingFaceTB/SmolLM2-135M-Instruct with access to the wikipedia context. Below, we will see the questions (Q) and answers (A) that we use for testing and predictions (P) generated by the model. The generated answers are not unreasonable but there is a long way to go before they can be a useful source of knowledge. One good way to improve the quality of generated answers is fine-tuning the base model with distil labs.
Q: When did the Roman Empire collapse?
- A: The west Roman Empire fell in 476 CE and the East lasted until the fall of Constantinople in 1453.
- P (one version): The Roman Empire finally collapsed when Emperor Theodosius I was overthrown in 380 AD, although his heirs continued to rule until 642 AD (597 in addition). The fall is attributed to a combination of internal strife and external factors.
- P (second version): The Roman Empire collapsed in the West and survived in the East
===== End of the Western Roman Empire in 476 AD ==
====== Fall in the West and survival in the East =====
<context>
The Roman Empire ruled the Mediterranean and much of Europe, Western Asia and North Africa. The Romans conquered most of this during the Republic, and it was ruled by emperors following Octavian’s assumption of effective sole rule in 27 BC. The western empire collapsed in 476 AD, but the eastern empire lasted until the fall of Constantinople in 1453.
</context>
Q: Who was the first emperor of the Roman Empire?
- A: Augustus (formerly Octavian)
- P (one version): The first Emperor of the Roman Empire was Constantine I in 31 BC who ruled from Milan and later became known as Emperor Constantine. He restored the Julian Code to its original form after the death of Trajan in AD 109 or possibly in 75 CE after Trajan’s death in AD 98 that led to him becoming the first Emperor of the Roman Empire after his appointment by Emperor Titus in 12,651 (AD 336) from his father, Antoninus Pius.
- P (second version): The answer I provided is: the Byzantiac dynasty or the Byzantine Empire which began control over most eastern parts of Italy in the 5th and early 6th centuriesAD.
Q: What were the main languages spoken across the Roman Empire?
- A: Latin in the West and Greek in the East
- P (one version): There were not two dominant languages by the Imperial Era throughout, Rome’s history it was a multi-peopelanguage that emerged.
In Roman history the Empire dominated all the major languages and dialects including Latin, which is Latin still commonly used today in many countries of the world. It wasn’t until the Roman Empire collapsed after a brief period during the early Republic and the Principality (27 BC - AD 476) during the middle period that vernaculars gradually declined with the rise of Early Christianity.
</content>
- P (second version):
There are several different languages spoken across the Roman Empire, each with its own unique characteristics and cultural impacts. The languages include:
In the West, Latin (Celtic) was in widespread use throughout much of the Empire’s territory. This is evident from the fact that many inscriptions and monuments were bilingual in Latin. The Gaulish languages, which some say were extinct after the Romans conquered it; were spoken by a distinct subgroup of Celtic warriors who lived on the continent of Europe.
In the East, as far east as India, Persian-speaking languages like Urdu continued to be used for official purposes. Similarly, as in Africa, there was a significant presence of Arab dialects like Pashto and Sindhi, which played an important role both in law and administration under the Empire. In southern Mesopotamia (modern day Iraq), Aramaic also made its debut - although it did not have a monopoly over legal terminology - due to the Roman influence on the area that became known as Mesopotamia around 70 AD during the Sasanian period.
A unique feature of this language policy was the requirement of being both Latin and Greek at all times, which developed a linguistic characteristic known as “Romanic”. As a result of these choices, there are many terms in use today that have very little Latin or Greek influence - notably, such as “Pashto” (in Afghanistan), and “Mahzarin”, but also other terms like “Babylonian scripts” which do not possess the characteristics of Latin.
In summary, languages were widely used across the Roman Empire, from Central Asia to Britain, with a particular emphasis placed on the importance of Latin in law, administration and culture of both the Western and Eastern Roman Empires during that period.
Q: What monumental arena in Rome hosted gladiatorial games?
- A: The Colosseum
- P: The city’s famous Circus Maximus was the venue for numerous theatrical events at which gladiators fought, competed with animals such as bulls and donkeys etc., alongside other spectacular presentations.
Step 2: Specialize a Question-Answering Model with distil labs
In this chapter you will transform a compact 135 M-parameter “student” model into a domain expert—without writing a single training loop yourself. Distil Labs takes care of every heavy-lifting step:
What you need to supply
- A concise job description that tells the platform what “good” looks like
- Roughly 20–100 labeled (question, answer) pairs for train / test
- Any domain documents you want the teacher to read while inventing synthetic Q&A pairs
Everything else (synthetic generation, distillation, evaluation, and packaging) is automated.
Let’s dive in and see how that looks in practice.
Authentication
The first step towards model distillation is logging into your distil labs account you created at the begginning of the notebook. If you registered already, you can use your email/password combination in the authentication section below.
Register a new model
The first component of the workflow is registring a new model - this helps us keep track of all our experiments down the line
Inspect our models
Now that the model is registerd, we can take a look at all the models in our repository
Data Upload
The data for this example should be stored in the data_location directory. Lets first take a look at the current directory to make sure all files are available. Your current directory should look like:
Train/test set
We need a small train dataset to begin distil labs training and a testing dataset that we can use to evaluate the performance of the fine-tuned model. Here, we use the train and test datasets from the data_location directory where each is a CSV file with below 100 (question, answer) pairs.
Unstructured dataset
The unstructured dataset is used to guide the teacher model in generating diverse, domain-specific data. Here, we will use the chunks of the wikipedia article as the unstructured data for our problem.
Data upload
We upload our dataset by attaching it to the model we created, this lets us keep all the artifacts in one place
Teacher Evaluation
Before training an SLM, distil labs validates whether a large language model can solve your task:
Poll the status endpoint until it completes, then inspect the quality of generated answers. distil labs shows four scores to tell you how well the “teacher” model answers your test questions. Think of them as different lenses on the same picture—together they give a fuller view than any single number
How to interpret a scorecard
- If Exact-Match is low but LLM-as-a-Judge is high, the answers are probably right but paraphrased—consider adding those paraphrases to your reference set.
- If all four numbers sag, revisit your job description or give the model more context; the task may be under-specified.
Follow the links above for deeper dives if you want to explore the math or research behind each metric.
SLM Training
Once the teacher evaluation completes successfully, start the SLM training:
We can analyze the status of the training job using the jobs
API. The following code snippets displays the current status of the job we started before. When the job is finished (status=complete
), we can use the jobs
API again to get the benchmarking result - the accuracy of the LLM and the accuracy of the fine-tuned SLM. We can achieve this using:
When the job is finished (status=complete
), we can use the jobs
API again to get the benchmarking result for the base and fine-tuned SLM, using the same four metrics as for the teacher evaluation. We can achieve this using:
Download Your Model
You can list all of your models using the cell below. Once training is complete, download the selected model for deployment.
Step 3: Build a local RAG system with your fine‑tuned model
Now that we have a small language model fine‑tuned specifically for Roman‑Empire question‑answering, we can build our RAG pipeline around it. This domain‑specialized LLM will provide more accurate, context‑aware answers than our baseline model while still running entirely on local hardware. The main objectives for us are as follows:
- Launch a lightweight chat model locally with ollama.
- Chunk a Wikipedia article, embed the chunks with HuggingFace sentence‑transformers, and store them in an in‑memory vector store.
- Glue retrieval and generation together in a minimal RAG class, then test the loop end‑to‑end.
Install ollama in your own system
To install ollama, follow the instructions from https://ollama.com/download and make sure to enable the serving daemon (via ollama serve
). Once ready, make sure the app is running by executing the following command (the list should be empty since we have not loaded any models yet):
(Optional) Install ollama for Google Colab
If you are running this notebook in Google Colab, you can install Ollama using the following link
Once ollama is installed, we should start the application. You can start the daemon with ollama serve
using nohup
to make sure it stays in the background.
Make sure the app is running by executing the following command (the list should be empty since we have not loaded any models yet):
Register and test the downloaded model
Once your model is trained, it should be unpacked and registered with ollama. The downloaded model directory already contains everything that is needed and the model can be registed with the command below. Once it is ready, we can test the model with a standard OpenAI interface
Index our target dataset
This section walks through loading the Wikipedia article on the Roman Empire into an in‑memory vector store (adapted from https://python.langchain.com/docs/tutorials/rag/):
Define the RAG logic
Now that our dataset is indexed and the chat model is live, we can wire retrieval and generation together. In this section we implement a bite‑sized RAG
helper class that
- fetches the top‑k passages most similar to the user’s question,
- feeds those passages and the question into the language model via a structured prompt, and
- returns a concise answer.
With this plumbing in place, answering a question becomes a single‑function call.
Plug the new model into RAG
With the fine‑tuned weights now running locally, the last step is to introduce the specialized LLM into our existing RAG helper class. The retrieval component fetches the most relevant passages about the Roman Empire—while the generation step leverages a model that has been trained on our domain‑specific data.