Evaluate LLM
LM-Eval provides a unified framework to test LLM on a wide range of evaluation tasks. The service is built on EleutherAI's lm-evaluation-harness and Unitxt. The TrustyAI Operator implements it via the LMEvalJob CRD so evaluation jobs can be created and managed on the cluster.
This document describes running an evaluation job against an LLM served as a Kubernetes InferenceService (OpenAI API–compatible).
TOC
PrerequisitesRun an evaluation jobResource statusGetting resultsOptional: offline storage and PVCSpec settings for offline modeEnvironment variables for offline cachesPreparing the PVC dataset for offline runsPrerequisites
- TrustyAI Operator installed (see Install TrustyAI).
- An LLM deployed as an InferenceService in the target namespace (e.g. vLLM or Hugging Face runtime).
- For tasks or tokenizers that require download from the internet (e.g. Hugging Face): allowOnline must be enabled on the LMEvalJob, and the cluster must permit it (e.g.
permitOnline: allowin the DataScienceCluster TrustyAI eval config). Enabling online access has security implications; see the Red Hat documentation.
Run an evaluation job
Create an LMEvalJob custom resource that points at the InferenceService and specifies the evaluation task(s). The operator runs the job in a pod; when the job finishes, results are written to status.results.
Example: evaluate an in-cluster LLM with the arc_easy task (lm-evaluation-harness task name). The model is reached via the predictor service URL; the tokenizer is loaded from Hugging Face (requires allowOnline: true and cluster permission).
-
Model type (
model)local-completionsorlocal-chat-completionsfor an OpenAI API–compatible server (e.g. InferenceService predictor).- They map to the OpenAI endpoints:
local-completionsto/v1/completionslocal-chat-completionsto/v1/chat/completions
- modelArgs.base_url must use the same path (e.g.
xxx/v1/completionsorxxx/v1/chat/completions).
-
Model arguments (
modelArgs)base_url: predictor URL including the path/v1/completionsforlocal-completions/v1/chat/completionsforlocal-chat-completions
model: usually matches the InferenceService name.tokenizer: Hugging Face model ID used for tokenization whentokenized_requestsis true.- Other parameters (e.g.
num_concurrent,max_retries,batch_size) follow the lm-evaluation-harness documentation.
-
Tasks (
taskList.taskNames)- List of lm-evaluation-harness task names (e.g.
arc_easy,mmlu). - The full set of supported tasks and wildcards is defined by lm-evaluation-harness (Task Guide / available tasks).
- Alternatively, use
taskRecipeswith Unitxt card/template for custom tasks.
- List of lm-evaluation-harness task names (e.g.
-
Online mode and code execution
allowOnline: whentrue, the job can download datasets and tokenizers from the internet (e.g. Hugging Face); requires cluster-level permission.allowCodeExecution: whentrue, the job may run code from downloaded resources; defaultfalse, enable only if required and permitted.
-
Outputs and limits
outputs.pvcManaged: creates an operator-managed PVC to store job results (size, e.g.100Mi). If only size is set, the PVC uses the cluster defaultStorageClass; if there is no defaultStorageClass, the PVC stays Pending and storage is not provisioned. Alternatively, useoutputs.pvcNameto bind an existing PVC.limit: optional cap on the number of samples (e.g."2"for a quick run).logSamples: whentrue, per-prompt model inputs and outputs are saved for inspection.
Resource status
The LMEvalJob status subresource reports the job state and, when finished, the evaluation results.
status.state: Current state of the job:New,Scheduled,Running,Complete,Cancelled, orSuspended. Wait forCompletebefore reading results.status.reason: Set when the job ends (e.g.Succeeded,Failed).status.results: When state isComplete, this field contains the evaluation results as a JSON string (metrics per task/recipe).status.message: Human-readable message; status.podName is the name of the job pod.
Traffic or result reads should be based on status.state == Complete (and, if applicable, status.reason == Succeeded).
Getting results
When status.state is Complete, results are available in status.results (JSON string). Example:
Example result shape for the arc_easy task (key fields; the full output includes configs, config, n-shot, n-samples, and environment info):
Example results (arc_easy)
Optional: offline storage and PVC
In offline mode the evaluation job does not access the internet; models and datasets must be read from a PVC (or from the image). Use this when the cluster disallows online access or for air-gapped environments.
Spec settings for offline mode
-
Job fields
allowOnline: false: the job does not download from the internet.offline.storage.pvcName: name of an existing PVC. The operator mounts this PVC into the job pod; the job loads models and datasets from paths under that mount.
-
Paths in spec
- Model / dataset loaders must point into the mounted PVC.
- For Hugging Face models, configure
modelArgsso the model path is under the PVC mount (for example/opt/app-root/src/hf_home/<model-dir>). - For
taskRecipesor custom Unitxt cards that load from disk, set loader paths under the same mount.
Environment variables for offline caches
Set environment variables in spec.pod.container.env so loaders use the PVC as cache/storage. For reliability, set all of the following to the same directory under the PVC mount (for example /opt/app-root/src/hf_home):
HF_DATASETS_CACHE: cache directory for Hugging Facedatasets.HF_HOME: Hugging Face home, used by tokenizers and other assets.TRANSFORMERS_CACHE: cache directory fortransformersmodels and tokenizers.
Example snippet for offline mode:
Use outputs.pvcName or outputs.pvcManaged only when storing evaluation results; offline.storage.pvcName is for input (models and datasets).
Preparing the PVC dataset for offline runs
In offline mode, the dataset (and tokenizer/model files if using HF) must already exist under the PVC. The job does not fetch them from the network.
One practical way to prepare the PVC is:
-
Online warm-up job
- Create an LMEvalJob with
allowOnline: true. - Mount the target PVC (the one that will be used later in offline mode), for example via
offline.storage.pvcNameor an extra volume. - Let this job download the required datasets/tokenizers/models so that they are stored under the PVC paths used by
HF_DATASETS_CACHE,HF_HOME, andTRANSFORMERS_CACHE, and by the configuredmodelArgs/ task loaders.
- Create an LMEvalJob with
-
Offline evaluation job
- Create the real evaluation job with
allowOnline: falseandoffline.storage.pvcNamepointing to the same PVC. - The job now reads all models and datasets from the PVC without any external network access.
- Create the real evaluation job with