Model deployment
After successfully training your small language model (SLM), the final step is to deploy it for inference. This page guides you through accessing and deploying your model.
Hugging Face Model Repository (recommended)
Once the model is trained, you can upload it directly to your private HuggingFace repository for easy deployment.
To upload your model, you will need
- Training ID of the model you wish to upload (
YOUR_TRAINING_ID
). - Hugging Face user access token with sufficient write privileges (
YOUR_HF_TOKEN
). - Name for the model which will be the name of HuggingFace repo (
NAME_OF_YOUR_MODEL
). - distil labs
AUTH_HEADER
You can upload the model with the following API call:
Once your model has been pushed to Hugging Face, you have the possibility to run your model using your preferred inference framework. Currently, Hugging Face provides support for over 10 frameworks such as Ollama and vLLM.
Deploying Question Answering Models from HuggingFace
HuggingFace provides out-of-the-box support for running your question-answer models using vLLM and Ollama. These frameworks run your models on a server and allow you to invoke the models using API requests. This is similar to how you may have used ChatGPT using the OpenAI API specification.
Note that for Ollama, your model needs to be in a GGUF format: GGUF. As such, we push models to two repositories on Hugging Face, one for the GGUF format and one for the safetensors format.
The following snippets show how you can run the models, which can be invoked using API requests:
Note that when using Ollama, you may need to upload your Ollama SSH key to Hugging Face to authenticate access to your private models. You can do this by following the instructions: Hugging Face
Deploying Classification Models from HuggingFace
Once your classification model has been pushed to the model repository, you can use the following snippet to test the model:
Local Deployment
Downloading Your Model
Alternatively, you can download the model using the download link provided by the API (Get your AUTH_HEADER
):
Use the link to download the model and then extract the tarball. After extraction, you will have a model
directory containing your trained SLM with all necessary files for deployment.
You can deploy the downloaded model using the inference framework of your choice. Below, we provide instructions on how to deploy your models using vLLM and Ollama.
Note that the instructions may differ slightly depending on your operating system. For the most up to date information, please refer to the vLLM and Ollama documentation.
Local Deployment with vLLM
Extract the files from the model tarball in the same directory you are planning to work in. You should see the following files:
To get started, we recommend setting up a new virtual environment.
Run the following to create and activate a virtual environment:
Install vLLM and OpenAI:
vLLM Deployment of Classification Models
To start the server, run:
This runs the server on port 11434.
Note that model
in the command above refers to the directory which contains our model weights.
When running the server using vllm serve
, the process does not run in the background so you will need to use another terminal session to invoke the model.
You can use the python script (model_client.py
) to invoke the model, which allows you to pass in your own question:
If you invoke this script without providing the --question
argument, an example from your test data will be used so you can familarize yourself with the output.
vLLM Deployment of Question Answering Models
To start the server, run:
Note that model
in the command above refers to the directory which contains our model weights.
When running the server using vllm serve
, the process does not run in the background so you will need to use another terminal session to invoke the model.
You can use the python script (model_client.py
) to invoke the model, which allows you to pass in your own question and context:
If you invoke this script without providing the --question
and --context
arguments, an example from your test data will be used so you can familarize yourself with the output.
Local Deployment with Ollama
Extract the files from the model tarball in the same directory you are planning to work in. You should see the following files:
To get started, we recommend setting up a new virtual environment.
Run the following to create and activate a virtual environment:
Install OpenAI:
Install Ollama by following the instructions here: Ollama
Change into the model/
directory, you should see the following files:
Create the model using the following command:
You can then use the python script (model_client.py
) to invoke the model, which allows you to pass in your own question and context:
If you invoke this script without providing the --question
and --context
arguments, an example from your test data will be used so you can familarize yourself with the output.
Production Deployment Considerations
When deploying your model to production, consider:
- Resource Requirements: Even small models benefit from GPU acceleration, especially for high-throughput applications.
- Security: Apply appropriate access controls, especially if your model has access to sensitive information.
- Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.