Teacher evaluation

Teacher evaluation is a critical step in the distil labs training pipeline that happens before actual SLM training begins. It serves several important purposes:

Feasibility Check: It validates whether a large language model (LLM) can accurately solve your task. If the teacher model can solve the task, the student model will be able to learn it effectively. If the teacher model cannot solve the task, you have an opportunity to refine your inputs before investing time in full SLM training.

Performance Benchmark: It establishes a performance expectation for your SLM. The accuracy of the teacher LLM provides the first approximation of the performance you can expect from your trained SLM.

Initiating teacher evaluation

After uploading your data, you can start teacher evaluation using the API as follows (get your token):

1import requests
2
3data = {"upload_id": upload_id}
4response = requests.post(
5 f"https://api.distillabs.ai/models/{model_id}/teacher-evaluations",
6 data=json.dumps(data),
7 headers={"Content-Type": "application/json", **auth_header},
8)
9
10pprint(response.json())
11eval_job_id = response.json()["id"]
12print(f"Started teacher evaluation with ID: {eval_job_id}")

Checking evaluation status and results

You can check the status of your teacher evaluation and retrieve results using:

1import requests
2import time
3
4running = True
5while running:
6 response = requests.get(
7 f"https://api-dev.distillabs.ai/teacher-evaluations/{eval_job_id}/status",
8 headers=auth_header
9 )
10 status = response.json()["status"]
11 if status != "JOB_RUNNING":
12 running = False
13 print(f"Evaluation status: {status}, re-checking in 10s...")
14 time.sleep(10)
15
16# Check evaluation results
17print(f"Status: {response.json()}")

In a Jupyter notebook, display the results with

1import pandas as pd
2print(pd.DataFrame(response.json()["results"]).transpose())

High accuracy on LLM evaluation indicates our task is well defined and we can move on to training. When training an SLM for this task, we can use the LLM evaluation as the quality benchmark for the trained model.

However, if teacher performance is low, consider:

  1. Revising your task description to be more specific
  2. Improving the quality of your example data
  3. Checking for inconsistencies in your dataset
  4. Ensuring your task is well-defined and solvable

Retrieving predictions

For more in-depth analysis, you can download the predictions on individual data points of the test dataset. The URL links to a JSON file that contains the predictions along with other information depending on which task type you have selected (i.e. classification/question-answering/tool-calling).

The URL of this file can be found using:

1print(response.json()["evaluation_predictions_download_url"])

You can then download this file from the terminal using:

$curl -o teacher_evaluation_predictions.json "<DOWNLOAD_URL>"

Note that the file is in JSON Lines format and can be read using:

1df = pd.read_json("teacher_evaluation_predictions.json", lines=True)