feat: add Runner, RunnerResult, Judge, and Evaluator by mattrmc1 · Pull Request #180 · launchdarkly/java-core

mattrmc1 · 2026-06-23T20:27:16Z

Summary

Adds the AIEVALS types and wires up the tracker factory. Callers can now implement a Runner to wrap any model provider, construct a Judge to evaluate AI outputs against a judge prompt with structured {score, reasoning} output, and coordinate multiple judges through an Evaluator. Config retrieval methods now produce real trackers (instead of no-ops) and expose the evaluator on every config type.

Four new public types and two wiring changes on LDAIClientImpl:

`Runner`

public interface Runner {
  RunnerResult run(String input, Map<String, Object> outputType) throws Exception;
  default RunnerResult run(String input) throws Exception;
}

Wraps a model provider SDK. The outputType parameter carries a JSON-Schema-like map when structured output is needed (e.g., a judge requesting {score, reasoning}). The single-arg overload delegates with outputType = null. Implementations should be thread-safe.

`RunnerResult`

public static Builder builder(String content, AIMetrics metrics);

Immutable result of a Runner invocation. Holds content (response text), metrics (AIMetrics from the tracker layer), optional raw (unmodified provider response), and optional parsed (structured output as Map<String, Object>). The parsed map is defensively copied and returned as unmodifiable.

`Judge`

public Judge(AIJudgeConfig config, Runner runner, LDLogger logger);

public JudgeResult evaluate(String input, String output);
public JudgeResult evaluate(String input, String output, double samplingRate);
public JudgeResult evaluateMessages(List<Message> messages, RunnerResult response);
public JudgeResult evaluateMessages(List<Message> messages, RunnerResult response, double samplingRate);

Evaluates an AI model output by invoking a runner with a formatted evaluation prompt and parsing the structured response. Sampling gate runs first — below the rate, returns sampled=false immediately. Otherwise formats the input/output, creates a fresh tracker via config.createTracker(), and delegates to tracker.trackMetricsOf with EVALUATION_SCHEMA. Parses score (Number, must be in [0.0, 1.0]) and reasoning (String, optional) from the structured response.

Runner exceptions are caught and returned as JudgeResult(success=false, errorMessage=...) — judge failures are results, not exceptions. The judge does not call trackJudgeResult; that is the caller's responsibility.

evaluateMessages formats a List<Message> as role: content lines joined by newlines, then delegates to evaluate.

`Evaluator`

public static Evaluator noop();
public Evaluator(Map<String, Judge> judges, JudgeConfiguration judgeConfiguration, LDLogger logger);

public CompletableFuture<List<JudgeResult>> evaluate(String input, String output);

Coordinates execution of a set of judges against a single input/output pair. Judges run sequentially in the order specified by the JudgeConfiguration. Missing judges are skipped with a warning. The noop singleton returns a completed future holding an empty list immediately.

The evaluator does not call trackJudgeResult — that is the managed type's responsibility (Plan 4).

For v1.0, all configs receive Evaluator.noop() since there is no RunnerFactory to auto-create runners for judge configs.

Tracker factory wiring

LDAIClientImpl now creates real LDAIConfigTrackerImpl instances instead of no-ops. A private trackerFactory method captures the config identity (configKey, variationKey, version, modelName, providerName, context) and returns a Supplier<LDAIConfigTracker> that produces a fresh tracker with a new runId on each call.

Default configs also receive real trackers — the configKey was requested even if no flag was found; variationKey is null.

`createTracker` on `LDAIClient`

LDAIConfigTracker createTracker(String resumptionToken, LDContext context);

Reconstructs a tracker from a resumption token for multi-turn or streaming scenarios that span multiple requests. Delegates to LDAIConfigTrackerImpl.fromResumptionToken.

Tracker hardening

trackMetricsOf now stops the clock before running the metrics extractor, so a slow extractor doesn't inflate the duration. If the extractor throws, operation duration is still recorded before the exception propagates.
Null-argument track calls (trackDuration(null), trackFeedback(null), etc.) downgraded from warn to debug per spec.
trackJudgeResult now guards against blank metricKey and infinite/NaN score.
Resumption token decode validates non-empty runId and configKey.

Config type changes

AIConfig base class gains an Evaluator field and getEvaluator() accessor. AICompletionConfig and AIAgentConfig constructors accept an Evaluator. AIJudgeConfig always wires Evaluator.noop() internally — judges do not evaluate themselves.

NoOpAIConfigTracker removed — all configs now get real LDAIConfigTrackerImpl instances.

Test plan

./gradlew :lib:sdk:server-ai:test passes
JudgeTest covers the judge surface:
- successfulEvaluationReturnsCorrectScore / scoreBoundaryZeroIsValid / scoreBoundaryOneIsValid
- reasoningIsOptional
- runnerExceptionResultsInFailure — caught, not rethrown
- nullParsedOutputResultsInFailure / missingScoreResultsInFailure
- scoreAboveOneResultsInFailure / scoreBelowZeroResultsInFailure
- samplingRateZeroAlwaysSkips / samplingRateOneAlwaysRuns
- evaluateMessagesFormatsCorrectly
- getConfigReturnsOriginal / getRunnerReturnsOriginal
EvaluatorTest covers the evaluator surface:
- noopReturnsEmptyList / noopReturnsSameInstance / noopFutureIsAlreadyDone
- singleJudgeIsRun / multipleJudgesAreAllRun
- missingJudgeIsSkipped
- evaluatorDoesNotCallTrackJudgeResult
- returnedFutureIsAlreadyDone
RunnerResultTest covers builder, immutability, and defensive copy

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.}

cursor · 2026-06-23T21:14:58Z

+        .success(true)
+        .judgeConfigKey(config.getKey())
+        .metricKey(config.getEvaluationMetricKey())
+        .score(score);


NaN scores pass validation

High Severity

When parsed score is a non-finite number such as NaN, the range check against [0.0, 1.0] does not reject it, so the judge returns success=true with that score. Downstream trackJudgeResult can then emit a metric with a non-finite value.

^{Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.}

cursor · 2026-06-23T21:14:58Z

+    this.judges = Objects.requireNonNull(judges, "judges");
+    this.judgeConfiguration = Objects.requireNonNull(judgeConfiguration, "judgeConfiguration");
+    this.logger = Objects.requireNonNull(logger, "logger");
+    this.isNoop = false;


Evaluator retains mutable judges map

Medium Severity

The Evaluator constructor stores the supplied judges map by reference without a defensive or unmodifiable copy, while the type is documented as immutable. Callers can mutate or replace entries after construction, changing which judges run or causing concurrent modification during evaluation.

^{Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.}

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

feat: add Runner, RunnerResult, Judge, and Evaluator

1a7e1f6

mattrmc1 changed the base branch from main to mmccarthy/AIC-2664/ai-config-tracker-overhaul June 23, 2026 21:13

mattrmc1 marked this pull request as ready for review June 23, 2026 21:13

mattrmc1 requested a review from a team as a code owner June 23, 2026 21:13

mattrmc1 marked this pull request as draft June 23, 2026 21:14

cursor Bot reviewed Jun 23, 2026

View reviewed changes

mattrmc1 added 3 commits June 23, 2026 17:33

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

a94b2bf

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

6c80aed

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

1355033

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Runner, RunnerResult, Judge, and Evaluator#180

feat: add Runner, RunnerResult, Judge, and Evaluator#180
mattrmc1 wants to merge 4 commits into
mmccarthy/AIC-2664/ai-config-tracker-overhaulfrom
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

mattrmc1 commented Jun 23, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 23, 2026

Uh oh!

cursor Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mattrmc1 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Runner

RunnerResult

Judge

Evaluator

Tracker factory wiring

createTracker on LDAIClient

Tracker hardening

Config type changes

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 23, 2026

Choose a reason for hiding this comment

NaN scores pass validation

Uh oh!

cursor Bot Jun 23, 2026

Choose a reason for hiding this comment

Evaluator retains mutable judges map

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattrmc1 commented Jun 23, 2026 •

edited

Loading

`Runner`

`RunnerResult`

`Judge`

`Evaluator`

`createTracker` on `LDAIClient`