Teacher evaluation

Teacher evaluation is a critical step in the distil labs training pipeline that happens before actual SLM training begins. It serves several important purposes:

Feasibility Check: It validates whether a large language model (LLM) can accurately solve your task. If the “teacher” model can solve the task, the “student” model will be able to learn it effectively. If the “teacher” model cannot solve the taks, you have an opportunity to refine your inputs before investing time in full SLM training.

Performance Benchmark: It establishes a performance ceiling for your SLM. The accuracy of the teacher LLM provides an approximation of the best performance you can expect from your trained SLM.

Initiating Teacher Evaluation

After uploading your data, you can start teacher evaluation using the API as follows (Get your AUTH_HEADER):

1import requests
2from pprint import pprint
3
4# Start teacher evaluation using the upload_id from your data upload
5response = requests.post(
6 f"https://api.distillabs.ai/teacher-evaluations/{upload_id}",
7 headers=AUTH_HEADER,
8)
9
10# Store the teacher evaluation ID for checking status later
11teacher_evaluation_id = response.json().get("id")
12pprint(response.json())

Checking Evaluation Status and Results

You can check the status of your teacher evaluation and retrieve results using:

1import requests
2from pprint import pprint
3
4# Check status and get results
5response = requests.get(
6 f"https://api.distillabs.ai/teacher-evaluations/{teacher_evaluation_id}/status",
7 headers=AUTH_HEADER,
8)

Display the results with

1display(pd.DataFrame(response.json().get("results")).transpose())

High accuracy on LLM evaluation indicates our task is well defined and we can move on to training. When training an SLM for this task, we can use the LLM evaluation as the quality benchmark for the trained model.

However, if teacher performance is low, consider:

  1. Revising your task description to be more specific
  2. Improving the quality of your example data
  3. Checking for inconsistencies in your dataset
  4. Ensuring your task is well-defined and solvable