Genkit for Node.js 1.0 is now available and production-ready! Learn more

Writing a Genkit Evaluator
Stay organized with collections Save and categorize content based on your preferences.

You can extend Genkit to support custom evaluation, using either an LLM as a judge, or by programmatic (heuristic) evaluation.

Evaluator definition

Evaluators are functions that assess an LLM's response. There are two main approaches to automated evaluation: heuristic evaluation and LLM-based evaluation. In the heuristic approach, you define a deterministic function. By contrast, in an LLM-based assessment, the content is fed back to an LLM, and the LLM is asked to score the output according to criteria set in a prompt.

The ai.defineEvaluator method, which you use to define an evaluator action in Genkit, supports either approach. This document explores a couple of examples of how to use this method for heuristic and LLM-based evaluations.

LLM-based Evaluators

An LLM-based evaluator leverages an LLM to evaluate the input, context, and output of your generative AI feature.

LLM-based evaluators in Genkit are made up of 3 components:

A prompt
A scoring function
An evaluator action

Define the prompt

For this example, the evaluator leverages an LLM to determine whether a food (the output) is delicious or not. First, provide context to the LLM, then describe what you want it to do, and finally, give it a few examples to base its response on.

Genkit’s definePrompt utility provides an easy way to define prompts with input and output validation. The following code is an example of setting up an evaluation prompt with definePrompt.

import{z}from"genkit";constDELICIOUSNESS_VALUES=['yes','no','maybe']asconst;constDeliciousnessDetectionResponseSchema=z.object({reason:z.string(),verdict:z.enum(DELICIOUSNESS_VALUES),});functiongetDeliciousnessPrompt(ai:Genkit){returnai.definePrompt({name:'deliciousnessPrompt',input:{schema:z.object({responseToTest:z.string(),}),},output:{schema:DeliciousnessDetectionResponseSchema,}prompt:`You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict. Examples: Output: Chicken parm sandwich Response: { "reason": "A classic and beloved dish.", "verdict": "yes" } Output: Boston Logan Airport tarmac Response: { "reason": "Not edible.", "verdict": "no" } Output: A juicy piece of gossip Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" } New Output: {{ responseToTest }} Response: `});}

Define the scoring function

Define a function that takes an example that includes output as required by the prompt, and scores the result. Genkit testcases include input as a required field, with output and context as optional fields. It is the responsibility of the evaluator to validate that all fields required for evaluation are present.

import{ModelArgument}from'genkit';import{BaseEvalDataPoint,Score}from'genkit/evaluator';/** * Score an individual test case for delciousness. */exportasyncfunctiondeliciousnessScore< CustomModelOptionsextendsz.ZodTypeAny, >(ai:Genkit,judgeLlm:ModelArgument<CustomModelOptions>,dataPoint:BaseEvalDataPoint,judgeConfig?:CustomModelOptions):Promise<Score>{constd=dataPoint;// Validate the input has required fieldsif(!d.output){thrownewError('Output is required for Deliciousness detection');}// Hydrate the prompt and generate an evaluation resultconstdeliciousnessPrompt=getDeliciousnessPrompt(ai);constresponse=awaitdeliciousnessPrompt({responseToTest:d.outputasstring,},{model:judgeLlm,config:judgeConfig,});// Parse the outputconstparsedResponse=response.output;if(!parsedResponse){thrownewError(`Unable to parse evaluator response: ${response.text}`);}// Return a scored responsereturn{score:parsedResponse.verdict,details:{reasoning:parsedResponse.reason},};}

Define the evaluator action

The final step is to write a function that defines the EvaluatorAction.

import{EvaluatorAction}from'genkit/evaluator';/** * Create the Deliciousness evaluator action. */exportfunctioncreateDeliciousnessEvaluator< ModelCustomOptionsextendsz.ZodTypeAny, >(ai:Genkit,judge:ModelArgument<ModelCustomOptions>,judgeConfig?:z.infer<ModelCustomOptions>):EvaluatorAction{returnai.defineEvaluator({name:`myCustomEvals/deliciousnessEvaluator`,displayName:'Deliciousness',definition:'Determines if output is considered delicous.',isBilled:true,},async(datapoint:BaseEvalDataPoint)=>{constscore=awaitdeliciousnessScore(ai,judge,datapoint,judgeConfig);return{testCaseId:datapoint.testCaseId,evaluation:score,};});}

The defineEvaluator method is similar to other Genkit constructors like defineFlow and defineRetriever. This method requires an EvaluatorFn to be provided as a callback. The EvaluatorFn method accepts a BaseEvalDataPoint object, which corresponds to a single entry in a dataset under evaluation, along with an optional custom-options parameter if specified. The function processes the datapoint and returns an EvalResponse object.

The Zod Schemas for BaseEvalDataPoint and EvalResponse are as follows.

`BaseEvalDataPoint`

exportconstBaseEvalDataPoint=z.object({testCaseId:z.string(),input:z.unknown(),output:z.unknown().optional(),context:z.array(z.unknown()).optional(),reference:z.unknown().optional(),testCaseId:z.string().optional(),traceIds:z.array(z.string()).optional(),});exportconstEvalResponse=z.object({sampleIndex:z.number().optional(),testCaseId:z.string(),traceId:z.string().optional(),spanId:z.string().optional(),evaluation:z.union([ScoreSchema,z.array(ScoreSchema)]),});

`ScoreSchema`

constScoreSchema=z.object({id:z.string().describe('Optional ID to differentiate multiple scores').optional(),score:z.union([z.number(),z.string(),z.boolean()]).optional(),error:z.string().optional(),details:z.object({reasoning:z.string().optional(),}).passthrough().optional(),});

The defineEvaluator object lets the user provide a name, a user-readable display name, and a definition for the evaluator. The display name and definiton are displayed along with evaluation results in the Dev UI. It also has an optional isBilled field that marks whether this evaluator can result in billing (e.g., it uses a billed LLM or API). If an evaluator is billed, the UI prompts the user for a confirmation in the CLI before allowing them to run an evaluation. This step helps guard against unintended expenses.

Heuristic Evaluators

A heuristic evaluator can be any function used to evaluate the input, context, or output of your generative AI feature.

Heuristic evaluators in Genkit are made up of 2 components:

A scoring function
An evaluator action

Define the scoring function

As with the LLM-based evaluator, define the scoring function. In this case, the scoring function does not need a judge LLM.

import{BaseEvalDataPoint,Score}from'genkit/evaluator';constUS_PHONE_REGEX=/[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}/i;/** * Scores whether a datapoint output contains a US Phone number. */exportasyncfunctionusPhoneRegexScore(dataPoint:BaseEvalDataPoint):Promise<Score>{constd=dataPoint;if(!d.output||typeofd.output!=='string'){thrownewError('String output is required for regex matching');}constmatches=US_PHONE_REGEX.test(d.outputasstring);constreasoning=matches?`Output matched US_PHONE_REGEX`:`Output did not match US_PHONE_REGEX`;return{score:matches,details:{reasoning},};}

Define the evaluator action

import{Genkit}from'genkit';import{BaseEvalDataPoint,EvaluatorAction}from'genkit/evaluator';/** * Configures a regex evaluator to match a US phone number. */exportfunctioncreateUSPhoneRegexEvaluator(ai:Genkit):EvaluatorAction{returnai.defineEvaluator({name:`myCustomEvals/usPhoneRegexEvaluator`,displayName:"Regex Match for US PHONE NUMBER",definition:"Uses Regex to check if output matches a US phone number",isBilled:false,},async(datapoint:BaseEvalDataPoint)=>{constscore=awaitusPhoneRegexScore(datapoint);return{testCaseId:datapoint.testCaseId,evaluation:score,};});}

Putting it together

Plugin definition

Plugins are registered with the framework by installing them at the time of initializing Genkit. To define a new plugin, use the genkitPlugin helper method to instantiate all Genkit actions within the plugin context.

This code sample shows two evaluators: the LLM-based deliciousness evaluator, and the regex-based US phone number evaluator. Instantiating these evaluators within the plugin context registers them with the plugin.

import{GenkitPlugin,genkitPlugin}from'genkit/plugin';exportfunctionmyCustomEvals< ModelCustomOptionsextendsz.ZodTypeAny >(options:{judge:ModelArgument<ModelCustomOptions>;judgeConfig?:ModelCustomOptions;}):GenkitPlugin{// Define the new pluginreturngenkitPlugin("myCustomEvals",async(ai:Genkit)=>{const{judge,judgeConfig}=options;// The plugin instatiates our custom evaluators within the context// of the `ai` object, making them available// throughout our Genkit application.createDeliciousnessEvaluator(ai,judge,judgeConfig);createUSPhoneRegexEvaluator(ai);});}exportdefaultmyCustomEvals;

Configure Genkit

Add the myCustomEvals plugin to your Genkit configuration.

For evaluation with Gemini, disable safety settings so that the evaluator can accept, detect, and score potentially harmful content.

import{gemini15Pro}from'@genkit-ai/googleai';constai=genkit({plugins:[vertexAI(),...myCustomEvals({judge:gemini15Pro,}),],...});

Using your custom evaluators

Once you instantiate your custom evaluators within the Genkit app context (either through a plugin or directly), they are ready to be used. The following example illustrates how to try out the deliciousness evaluator with a few sample inputs and outputs.

1. Create a json file `deliciousness_dataset.json` with the following content:

[{"testCaseId":"delicous_mango","input":"What is a super delicious fruit","output":"A perfectly ripe mango – sweet, juicy, and with a hint of tropical sunshine."},{"testCaseId":"disgusting_soggy_cereal","input":"What is something that is tasty when fresh but less tasty after some time?","output":"Stale, flavorless cereal that's been sitting in the box too long."}]

2. Use the Genkit CLI to run the evaluator against these test cases.

# Start your genkit runtime genkitstart--<commandtostartyourapp>genkiteval:rundeliciousness_dataset.json--evaluators=myCustomEvals/deliciousnessEvaluator

3. Navigate to `localhost:4000/evaluate` to view your results in the Genkit UI.

It is important to note that confidence in custom evaluators increases as you benchmark them with standard datasets or approaches. Iterate on the results of such benchmarks to improve your evaluators' performance until it reaches the targeted level of quality.

Writing a Genkit Evaluator Stay organized with collections Save and categorize content based on your preferences.

Evaluator definition

LLM-based Evaluators

Define the prompt

Define the scoring function

Define the evaluator action

BaseEvalDataPoint

ScoreSchema

Heuristic Evaluators

Define the scoring function

Define the evaluator action

Putting it together

Plugin definition

Configure Genkit

Using your custom evaluators

Writing a Genkit Evaluator
Stay organized with collections Save and categorize content based on your preferences.

`BaseEvalDataPoint`

`ScoreSchema`