Model deployment

After successfully training your small language model (SLM), the final step is to deploy it for inference. This page guides you through accessing and deploying your model.

Once the model is trained, you can upload it directly to your private HuggingFace repository for easy deployment.

To upload your model, you will need

  1. Training ID of the model you wish to upload (YOUR_TRAINING_ID).
  2. Hugging Face user access token with sufficient write privileges (YOUR_HF_TOKEN).
  3. Name for the model which will be the name of HuggingFace repo (NAME_OF_YOUR_MODEL).
  4. distil labs AUTH_HEADER

You can upload the model with the following API call:

1import json
2
3slm_training_job_id = "YOUR_TRAINING_ID"
4hf_details = {
5 "hf_token": "YOUR_HF_TOKEN",
6 "repo_id": "NAME_OF_YOUR_MODEL"
7}
8
9# Push model to hub
10response = requests.post(
11 f"https://api.distillabs.ai/trainings/{slm_training_job_id}/huggingface_models",
12 data=json.dumps(hf_details),
13 headers={"content-type": "application/json", **AUTH_HEADER},
14)
15
16print(response.json())

Once your model has been pushed to Hugging Face, you have the possibility to run your model using your preferred inference framework. Currently, Hugging Face provides support for over 10 frameworks such as Ollama and vLLM.

Deploying Question Answering Models from HuggingFace

HuggingFace provides out-of-the-box support for running your question-answer models using vLLM and Ollama. These frameworks run your models on a server and allow you to invoke the models using API requests. This is similar to how you may have used ChatGPT using the OpenAI API specification.

Note that for Ollama, your model needs to be in a GGUF format: GGUF. As such, we push models to two repositories on Hugging Face, one for the GGUF format and one for the safetensors format.

The following snippets show how you can run the models, which can be invoked using API requests:

1from transformers import pipeline
2
3pipe = pipeline("text-generation", model="<USERNAME>/<MODEL_NAME>")
4messages = [
5 {"role": "user", "content": "Who are you?"},
6]
7pipe(messages)

Note that when using Ollama, you may need to upload your Ollama SSH key to Hugging Face to authenticate access to your private models. You can do this by following the instructions: Hugging Face

Deploying Classification Models from HuggingFace

Once your classification model has been pushed to the model repository, you can use the following snippet to test the model:

transformers
1from transformers import pipeline
2
3pipe = pipeline("text-classification", model="<USERNAME>/<MODEL_NAME>")
4pipe("<INPUT>.")

Local Deployment

Downloading Your Model

Alternatively, you can download the model using the download link provided by the API (Get your AUTH_HEADER):

1import requests
2import os
3from pprint import pprint
4
5# Get the model download information
6slm_training_job_id = "YOUR_TRAINING_ID"
7response = requests.get(
8 f"https://api.distillabs.ai/trainings/{slm_training_job_id}/model",
9 headers=AUTH_HEADER,
10)
11
12# Extract download URL
13pprint(response.json())

Use the link to download the model and then extract the tarball. After extraction, you will have a model directory containing your trained SLM with all necessary files for deployment.

You can deploy the downloaded model using the inference framework of your choice. Below, we provide instructions on how to deploy your models using vLLM and Ollama.

Note that the instructions may differ slightly depending on your operating system. For the most up to date information, please refer to the vLLM and Ollama documentation.

Local Deployment with vLLM

Extract the files from the model tarball in the same directory you are planning to work in. You should see the following files:

├── model/
├── model-adapters/
├── model_client.py
├── README.md

To get started, we recommend setting up a new virtual environment.

Run the following to create and activate a virtual environment:

$python -m venv serve
>source serve/bin/activate

Install vLLM and OpenAI:

$pip install vllm openai

vLLM Deployment of Classification Models

To start the server, run:

$vllm serve model --task classify --api-key EMPTY --port 11434

This runs the server on port 11434.

Note that model in the command above refers to the directory which contains our model weights.

When running the server using vllm serve, the process does not run in the background so you will need to use another terminal session to invoke the model.

You can use the python script (model_client.py) to invoke the model, which allows you to pass in your own question:

$python model_client.py --question "QUESTION"

If you invoke this script without providing the --question argument, an example from your test data will be used so you can familarize yourself with the output.

vLLM Deployment of Question Answering Models

To start the server, run:

$vllm serve model --api-key EMPTY

Note that model in the command above refers to the directory which contains our model weights.

When running the server using vllm serve, the process does not run in the background so you will need to use another terminal session to invoke the model.

You can use the python script (model_client.py) to invoke the model, which allows you to pass in your own question and context:

$python model_client.py --question "QUESTION" --context "CONTEXT"

If you invoke this script without providing the --question and --context arguments, an example from your test data will be used so you can familarize yourself with the output.

Local Deployment with Ollama

Extract the files from the model tarball in the same directory you are planning to work in. You should see the following files:

├── model/
├── model-adapters/
├── model_client.py
├── README.md

To get started, we recommend setting up a new virtual environment.

Run the following to create and activate a virtual environment:

$python -m venv serve
>source serve/bin/activate

Install OpenAI:

$pip install openai

Install Ollama by following the instructions here: Ollama

Change into the model/ directory, you should see the following files:

├── Modelfile
├── config.json
├── model.safetensors
├── ...

Create the model using the following command:

$ollama create model -f Modelfile

You can then use the python script (model_client.py) to invoke the model, which allows you to pass in your own question and context:

$python model_client.py --question "QUESTION" --context "CONTEXT"

If you invoke this script without providing the --question and --context arguments, an example from your test data will be used so you can familarize yourself with the output.

Production Deployment Considerations

When deploying your model to production, consider:

  • Resource Requirements: Even small models benefit from GPU acceleration, especially for high-throughput applications.
  • Security: Apply appropriate access controls, especially if your model has access to sensitive information.
  • Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.