Local Deployment

You can download and deploy your trained model locally using the inference framework of your choice.

Downloading your model

You can download the model using the link provided by the API (Get your token):

1import requests
2import os
3
4# Get the model download information
5slm_training_job_id = "YOUR_TRAINING_ID"
6response = requests.get(
7 f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
8 headers={"Authorization": f"Bearer {token}"},
9)
10
11# Extract download URL
12print(response.json())

Use the link to download the model and then extract the tarball. After extraction, you will have a model directory containing your trained SLM with all necessary files for deployment.

You can deploy the downloaded model using the inference framework of your choice. Below, we provide instructions on how to deploy your models using vLLM and Ollama.

Note that the instructions may differ slightly depending on your operating system. For the most up to date information, please refer to the vLLM and Ollama documentation.

Local deployment with vLLM

Extract the files from the model tarball in the same directory you are planning to work in. You should see the following files:

├── model/
├── model-adapters/
├── model_client.py
├── README.md

To get started, we recommend setting up a new virtual environment.

Run the following to create and activate a virtual environment:

$python -m venv serve
>source serve/bin/activate

Install vLLM and OpenAI:

$pip install vllm openai

vLLM deployment of classification models

To start the server, run:

$vllm serve model --task classify --api-key EMPTY --port 11434

This runs the server on port 11434.

Note that model in the command above refers to the directory which contains our model weights.

When running the server using vllm serve, the process does not run in the background so you will need to use another terminal session to invoke the model.

You can use the python script (model_client.py) to invoke the model, which allows you to pass in your own question:

$python model_client.py --question "QUESTION"

If you invoke this script without providing the --question argument, an example from your test data will be used so you can familiarize yourself with the output.

vLLM deployment of question answering models

To start the server, run:

$vllm serve model --api-key EMPTY

Note that model in the command above refers to the directory which contains our model weights.

When running the server using vllm serve, the process does not run in the background so you will need to use another terminal session to invoke the model.

You can use the python script (model_client.py) to invoke the model, which allows you to pass in your own question and context:

$python model_client.py --question "QUESTION" --context "CONTEXT"

If you invoke this script without providing the --question and --context arguments, an example from your test data will be used so you can familiarize yourself with the output.

Local deployment with Ollama

Extract the files from the model tarball in the same directory you are planning to work in. You should see the following files:

├── model/
├── model-adapters/
├── model.gguf
├── Modelfile
├── model_client.py
├── README.md

To get started, we recommend setting up a new virtual environment.

Run the following to create and activate a virtual environment:

$python -m venv serve
>source serve/bin/activate

Install OpenAI:

$pip install openai

Install Ollama by following the instructions here: Ollama

Change into the directory of the extracted tarball. Then, create the model using the following command:

$ollama create model -f Modelfile

Note: for older model versions, Modelfile is instead located at model/Modelfile.

You can then use the python script (model_client.py) to invoke the model, which allows you to pass in your own question and context:

$python model_client.py --question "QUESTION" --context "CONTEXT"

If you invoke this script without providing the --question and --context arguments, an example from your test data will be used so you can familiarize yourself with the output.

Production deployment considerations

When deploying your model to production, consider:

  • Resource Requirements: Even small models benefit from GPU acceleration, especially for high-throughput applications.
  • Security: Apply appropriate access controls, especially if your model has access to sensitive information.
  • Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.