Model deployment

Once your model is trained, you can download and deploy it locally using the inference framework of your choice. Alternatively, you can push it to Hugging Face Hub for easy sharing and deployment.

Downloading your model

Download your trained model using the CLI or API:

1distil model download <model-id>

After downloading, extract the tarball. You will have a directory containing your trained SLM with all necessary files for deployment:

├── model/
├── model-adapters/
├── Modelfile
├── model_client.py
├── README.md

Deploying with vLLM

vLLM is a high-performance inference engine for LLMs. To get started, set up a virtual environment and install dependencies:

1python -m venv serve
2source serve/bin/activate
3pip install vllm openai

Start the vLLM server:

1vllm serve model --api-key EMPTY

Note that model refers to the directory containing your model weights. The server runs in the foreground, so make sure to run the server in a separate window or a background process.

Query the model using the OpenAI-compatible API:

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:8000/v1",
5 api_key="EMPTY",
6)
7
8response = client.chat.completions.create(
9 model="model",
10 messages=[
11 {"role": "system", "content": "You are a helpful assistant."},
12 {"role": "user", "content": "Your question here"},
13 ],
14)
15print(response.choices[0].message.content)

Or use the provided client script:

1python model_client.py --question "Your question here"

For question answering models that require context, use the --context flag:

$python model_client.py --question "Your question here" --context "Your context here"

Deploying with Ollama

Ollama makes it easy to run LLMs locally.

Install Ollama following the instructions at ollama.com, then set up your environment:

1python -m venv serve
2source serve/bin/activate
3pip install openai

Create and run the model:

1ollama create my-model -f model/Modelfile
2ollama run my-model

Query the model using the OpenAI-compatible API:

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:11434/v1",
5 api_key="ollama",
6)
7
8response = client.chat.completions.create(
9 model="my-model",
10 messages=[
11 {"role": "system", "content": "You are a helpful assistant."},
12 {"role": "user", "content": "Your question here"},
13 ],
14)
15print(response.choices[0].message.content)

Or use the provided client script:

1python model_client.py --question "Your question here"

For question answering models that require context, use the --context flag:

$python model_client.py --question "Your question here" --context "Your context here"

Pushing to Hugging Face Hub (API only)

You can upload your model directly to your private Hugging Face repository. Once pushed, you can deploy it directly from Hugging Face using various inference frameworks.

Requirements:

  • Training ID of the model (YOUR_TRAINING_ID)
  • Hugging Face user access token with write privileges (YOUR_HF_TOKEN)
  • Repository name for your model (YOUR_USERNAME/MODEL_NAME)
1import json
2import requests
3
4# See Account and Authentication for distil_bearer_token() implementation
5auth_header = {"Authorization": f"Bearer {distil_bearer_token()}"}
6
7slm_training_job_id = "YOUR_TRAINING_ID"
8hf_details = {
9 "hf_token": "YOUR_HF_TOKEN",
10 "repo_id": "YOUR_USERNAME/MODEL_NAME"
11}
12
13response = requests.post(
14 f"https://api.distillabs.ai/trainings/{slm_training_job_id}/huggingface_models",
15 data=json.dumps(hf_details),
16 headers={"Content-Type": "application/json", **auth_header},
17)
18print(response.json())

Note that for Ollama, your model needs to be in a GGUF format: GGUF. As such, we push models to two repositories on Hugging Face, one for the GGUF format and one for the safetensors format. Once your model is on Hugging Face, you can deploy it directly using vLLM or Ollama:

1pip install vllm
2vllm serve "YOUR_USERNAME/MODEL_NAME"

Note: When using Ollama with private models, you may need to upload your Ollama SSH key to Hugging Face. See the Hugging Face documentation for instructions.

Production considerations

When deploying your model to production, consider:

  • Resource Requirements: Even small models benefit from GPU acceleration for high-throughput applications.
  • Security: Apply appropriate access controls, especially if your model handles sensitive information.
  • Container Deployment: Consider packaging your model with Docker for consistent deployment across environments.