-
Notifications
You must be signed in to change notification settings - Fork 10
feat: add Runner, RunnerResult, Judge, and Evaluator #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mattrmc1
wants to merge
25
commits into
main
Choose a base branch
from
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
7c4dbde
[AIC-2664] Impl trackers (first pass)
mattrmc1 a0c8784
fix: default tracker version to 1 and remove version clamp from token…
mattrmc1 1a7e1f6
feat: add Runner, RunnerResult, Judge, and Evaluator
mattrmc1 bed4ca2
guard against null AIMetrics
mattrmc1 2b47c86
fix: guard against blank metricKey and infinite/invalid score
mattrmc1 4ef3de2
fix: MAX_TOKEN_BYTES -> MAX_TOKEN_LENGTH
mattrmc1 1be0a1e
fix: guard against empty runId and configKey
mattrmc1 8e81ea0
fix: Add warning comment to createTracker public call
mattrmc1 e81e2f5
fix: use trim + isEmpty to support java 8
mattrmc1 c21fdd7
fix: stop trackMetricsOf clock before running metrics extractor
mattrmc1 4c96dca
fix: record operation duration when trackMetricsOf extractor throws
mattrmc1 4da5478
fix: downgrade null-arg track logs from warn to debug per spec
mattrmc1 a94b2bf
Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…
mattrmc1 394a044
fix: remove unnecessary NoOpAIConfigTracker
mattrmc1 6c80aed
Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…
mattrmc1 5381bf4
fix: remove resumption-token length cap
mattrmc1 1355033
Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…
mattrmc1 add48f9
fix: guard against NaN scores
mattrmc1 1bd6777
fix: defensively copy judges map in Evaluator constructor
mattrmc1 9a8143e
fix: use Java 8-compatible map/list construction in Judge
mattrmc1 3aa5d08
fix: Add security note to LDAIConfigTracker.getResumptionToken()
mattrmc1 121b140
fix: Add security note to MetricSummary.getResumptionToken()
mattrmc1 faa4981
Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…
mattrmc1 f42de0b
fix: remove reasoning from Judge schema required fields
mattrmc1 59835e3
Merge branch 'main' of github.com:launchdarkly/java-core into mmccart…
mattrmc1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
94 changes: 94 additions & 0 deletions
94
lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Evaluator.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| package com.launchdarkly.sdk.server.ai; | ||
|
|
||
| import com.launchdarkly.logging.LDLogger; | ||
| import com.launchdarkly.sdk.server.ai.datamodel.LDAIConfigTypes.JudgeConfiguration; | ||
| import com.launchdarkly.sdk.server.ai.datamodel.LDAITrackingTypes.JudgeResult; | ||
|
|
||
| import java.util.ArrayList; | ||
| import java.util.Collections; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
| import java.util.concurrent.CompletableFuture; | ||
|
|
||
| /** | ||
| * Coordinates evaluation of an AI Config output by running a set of {@link Judge} instances. | ||
| * <p> | ||
| * An {@code Evaluator} is attached to an {@link AICompletionConfig} or {@link AIAgentConfig} and | ||
| * invoked by managed AI types (plan 4). In v1.0, the evaluator returned by the config retrieval | ||
| * methods is always a noop that returns an empty list immediately. | ||
| * <p> | ||
| * Instances are immutable and thread-safe. | ||
| */ | ||
| public final class Evaluator { | ||
| private static final Evaluator NOOP = new Evaluator(); | ||
|
|
||
| private final Map<String, Judge> judges; | ||
| private final JudgeConfiguration judgeConfiguration; | ||
| private final LDLogger logger; | ||
| private final boolean isNoop; | ||
|
|
||
| private Evaluator() { | ||
| this.judges = Collections.emptyMap(); | ||
| this.judgeConfiguration = null; | ||
| this.logger = null; | ||
| this.isNoop = true; | ||
| } | ||
|
|
||
| /** | ||
| * Constructs an evaluator with the given judges and configuration. | ||
| * | ||
| * @param judges a map from judge config key to {@link Judge} instance; must not be {@code null} | ||
| * @param judgeConfiguration the judge configuration listing which judges to run and their sampling | ||
| * rates; must not be {@code null} | ||
| * @param logger the logger; must not be {@code null} | ||
| */ | ||
| public Evaluator(Map<String, Judge> judges, JudgeConfiguration judgeConfiguration, LDLogger logger) { | ||
| this.judges = Collections.unmodifiableMap(new HashMap<>(Objects.requireNonNull(judges, "judges"))); | ||
| this.judgeConfiguration = Objects.requireNonNull(judgeConfiguration, "judgeConfiguration"); | ||
| this.logger = Objects.requireNonNull(logger, "logger"); | ||
| this.isNoop = false; | ||
| } | ||
|
|
||
| /** | ||
| * Returns the shared noop evaluator, which immediately returns an empty result list without | ||
| * logging any warnings. | ||
| * | ||
| * @return the noop singleton, never {@code null} | ||
| */ | ||
| public static Evaluator noop() { | ||
| return NOOP; | ||
| } | ||
|
|
||
| /** | ||
| * Runs all configured judges against the given input/output pair and returns their results. | ||
| * <p> | ||
| * When this is the noop evaluator, returns a completed future holding an empty list immediately. | ||
| * Otherwise, judges are run sequentially in the order specified by the {@link JudgeConfiguration}. | ||
| * Judges referenced in the configuration but absent from the judges map are skipped with a | ||
| * warning; this is not an error. | ||
| * <p> | ||
| * This method does NOT call {@code trackJudgeResult} — that is the caller's responsibility. | ||
| * | ||
| * @param input the message history or prompt that was sent to the model | ||
| * @param output the model's response to evaluate | ||
| * @return a completed future holding the list of judge results; never {@code null} | ||
| */ | ||
| public CompletableFuture<List<JudgeResult>> evaluate(String input, String output) { | ||
| if (isNoop) { | ||
| return CompletableFuture.completedFuture(Collections.emptyList()); | ||
| } | ||
|
|
||
| List<JudgeResult> results = new ArrayList<>(); | ||
| for (JudgeConfiguration.Judge entry : judgeConfiguration.getJudges()) { | ||
| Judge judge = judges.get(entry.getKey()); | ||
| if (judge == null) { | ||
| logger.warn("Evaluator: no judge found for key '{}', skipping", entry.getKey()); | ||
| continue; | ||
| } | ||
| results.add(judge.evaluate(input, output, entry.getSamplingRate())); | ||
| } | ||
| return CompletableFuture.completedFuture(results); | ||
| } | ||
| } | ||
210 changes: 210 additions & 0 deletions
210
lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| package com.launchdarkly.sdk.server.ai; | ||
|
|
||
| import com.launchdarkly.logging.LDLogger; | ||
| import com.launchdarkly.sdk.server.ai.datamodel.LDAIConfigTypes.Message; | ||
| import com.launchdarkly.sdk.server.ai.datamodel.LDAITrackingTypes.JudgeResult; | ||
|
|
||
| import java.util.Collections; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
| import java.util.concurrent.ThreadLocalRandom; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| /** | ||
| * Evaluates an AI model output against a judge prompt, returning a scored {@link JudgeResult}. | ||
| * <p> | ||
| * A {@code Judge} wraps an {@link AIJudgeConfig} and a {@link Runner}. Each call to | ||
| * {@link #evaluate} or {@link #evaluateMessages} invokes the runner with a formatted evaluation | ||
| * prompt and parses the structured {@code {score, reasoning}} response. Evaluation can be sampled | ||
| * to reduce cost: pass a {@code samplingRate} of {@code 0.0} to always skip, or {@code 1.0} to | ||
| * always run. | ||
| * <p> | ||
| * Instances are immutable and thread-safe. | ||
| */ | ||
| public final class Judge { | ||
| /** | ||
| * JSON-Schema fragment sent to the runner as the {@code outputType}, requesting structured | ||
| * {@code {score, reasoning}} output. | ||
| */ | ||
| private static final Map<String, Object> EVALUATION_SCHEMA; | ||
| static { | ||
| Map<String, Object> scoreSchema = new HashMap<>(); | ||
| scoreSchema.put("type", "number"); | ||
|
|
||
| Map<String, Object> reasoningSchema = new HashMap<>(); | ||
| reasoningSchema.put("type", "string"); | ||
|
|
||
| Map<String, Object> properties = new HashMap<>(); | ||
| properties.put("score", Collections.unmodifiableMap(scoreSchema)); | ||
| properties.put("reasoning", Collections.unmodifiableMap(reasoningSchema)); | ||
|
|
||
| Map<String, Object> schema = new HashMap<>(); | ||
| schema.put("type", "object"); | ||
| schema.put("properties", Collections.unmodifiableMap(properties)); | ||
| schema.put("required", Collections.singletonList("score")); | ||
|
|
||
| EVALUATION_SCHEMA = Collections.unmodifiableMap(schema); | ||
| } | ||
|
|
||
| private final AIJudgeConfig config; | ||
| private final Runner runner; | ||
| private final LDLogger logger; | ||
|
|
||
| /** | ||
| * Constructs a judge. | ||
| * | ||
| * @param config the judge AI Config; must not be {@code null} | ||
| * @param runner the runner to invoke; must not be {@code null} | ||
| * @param logger the logger; must not be {@code null} | ||
| */ | ||
| public Judge(AIJudgeConfig config, Runner runner, LDLogger logger) { | ||
| this.config = Objects.requireNonNull(config, "config"); | ||
| this.runner = Objects.requireNonNull(runner, "runner"); | ||
| this.logger = Objects.requireNonNull(logger, "logger"); | ||
| } | ||
|
|
||
| /** | ||
| * Evaluates the given input/output pair, always running (sampling rate {@code 1.0}). | ||
| * | ||
| * @param input the message history or prompt that was sent to the model | ||
| * @param output the model's response to evaluate | ||
| * @return the evaluation result; never {@code null} | ||
| */ | ||
| public JudgeResult evaluate(String input, String output) { | ||
| return evaluate(input, output, 1.0); | ||
| } | ||
|
|
||
| /** | ||
| * Evaluates the given input/output pair, subject to the given sampling rate. | ||
| * | ||
| * @param input the message history or prompt that was sent to the model | ||
| * @param output the model's response to evaluate | ||
| * @param samplingRate the fraction of evaluations to actually run; {@code 0.0} always skips, | ||
| * {@code 1.0} always runs | ||
| * @return the evaluation result; never {@code null} | ||
| */ | ||
| public JudgeResult evaluate(String input, String output, double samplingRate) { | ||
| if (ThreadLocalRandom.current().nextDouble() >= samplingRate) { | ||
| return JudgeResult.builder() | ||
| .sampled(false) | ||
| .success(false) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .build(); | ||
| } | ||
|
|
||
| String formatted = "MESSAGE HISTORY:\n" + input + "\n\nRESPONSE TO EVALUATE:\n" + output; | ||
| LDAIConfigTracker tracker = config.createTracker(); | ||
|
|
||
| RunnerResult result; | ||
| try { | ||
| result = tracker.trackMetricsOf(RunnerResult::getMetrics, () -> runner.run(formatted, EVALUATION_SCHEMA)); | ||
| } catch (Exception ex) { | ||
| return JudgeResult.builder() | ||
| .sampled(true) | ||
| .success(false) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .errorMessage(ex.getMessage()) | ||
| .build(); | ||
| } | ||
|
|
||
| Map<String, Object> parsed = result.getParsed(); | ||
| if (parsed == null) { | ||
| logger.warn("Judge {}: runner returned null parsed output", config.getKey()); | ||
| return JudgeResult.builder() | ||
| .sampled(true) | ||
| .success(false) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .build(); | ||
| } | ||
|
|
||
| Object scoreRaw = parsed.get("score"); | ||
| if (!(scoreRaw instanceof Number)) { | ||
| logger.warn("Judge {}: parsed output missing numeric score", config.getKey()); | ||
| return JudgeResult.builder() | ||
| .sampled(true) | ||
| .success(false) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .build(); | ||
| } | ||
| double score = ((Number) scoreRaw).doubleValue(); | ||
| if (!Double.isFinite(score) || score < 0.0 || score > 1.0) { | ||
| logger.warn("Judge {}: score {} is outside [0.0, 1.0]", config.getKey(), score); | ||
| return JudgeResult.builder() | ||
| .sampled(true) | ||
| .success(false) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .build(); | ||
| } | ||
|
|
||
| JudgeResult.Builder resultBuilder = JudgeResult.builder() | ||
| .sampled(true) | ||
| .success(true) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .score(score); | ||
|
cursor[bot] marked this conversation as resolved.
|
||
|
|
||
| Object reasoningRaw = parsed.get("reasoning"); | ||
| if (reasoningRaw instanceof String) { | ||
| resultBuilder.reasoning((String) reasoningRaw); | ||
| } else if (reasoningRaw != null) { | ||
| logger.warn("Judge {}: reasoning is not a string, ignoring", config.getKey()); | ||
| } | ||
|
|
||
| return resultBuilder.build(); | ||
| } | ||
|
|
||
| /** | ||
| * Evaluates a message list and runner response, always running (sampling rate {@code 1.0}). | ||
| * <p> | ||
| * Messages are formatted as {@code role: content} lines, joined by newlines. | ||
| * | ||
| * @param messages the messages that were sent to the model | ||
| * @param response the runner result whose {@link RunnerResult#getContent() content} is evaluated | ||
| * @return the evaluation result; never {@code null} | ||
| */ | ||
| public JudgeResult evaluateMessages(List<Message> messages, RunnerResult response) { | ||
| return evaluateMessages(messages, response, 1.0); | ||
| } | ||
|
|
||
| /** | ||
| * Evaluates a message list and runner response, subject to the given sampling rate. | ||
| * <p> | ||
| * Messages are formatted as {@code role: content} lines, joined by newlines. | ||
| * | ||
| * @param messages the messages that were sent to the model | ||
| * @param response the runner result whose {@link RunnerResult#getContent() content} is evaluated | ||
| * @param samplingRate the fraction of evaluations to actually run | ||
| * @return the evaluation result; never {@code null} | ||
| */ | ||
| public JudgeResult evaluateMessages(List<Message> messages, RunnerResult response, double samplingRate) { | ||
| String formattedMessages = messages == null ? "" : messages.stream() | ||
| .map(m -> m.getRole().getWireValue() + ": " + m.getContent()) | ||
| .collect(Collectors.joining("\n")); | ||
| return evaluate(formattedMessages, response == null ? "" : response.getContent(), samplingRate); | ||
| } | ||
|
|
||
| /** | ||
| * Returns the judge AI Config this instance was constructed with. | ||
| * | ||
| * @return the judge config, never {@code null} | ||
| */ | ||
| public AIJudgeConfig getConfig() { | ||
| return config; | ||
| } | ||
|
|
||
| /** | ||
| * Returns the runner this instance was constructed with. | ||
| * | ||
| * @return the runner, never {@code null} | ||
| */ | ||
| public Runner getRunner() { | ||
| return runner; | ||
| } | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.