Skip to content

feat: add Runner, RunnerResult, Judge, and Evaluator#180

Draft
mattrmc1 wants to merge 4 commits into
mmccarthy/AIC-2664/ai-config-tracker-overhaulfrom
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
Draft

feat: add Runner, RunnerResult, Judge, and Evaluator#180
mattrmc1 wants to merge 4 commits into
mmccarthy/AIC-2664/ai-config-tracker-overhaulfrom
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

Conversation

@mattrmc1

@mattrmc1 mattrmc1 commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Adds the AIEVALS types and wires up the tracker factory. Callers can now implement a Runner to wrap any model provider, construct a Judge to evaluate AI outputs against a judge prompt with structured {score, reasoning} output, and coordinate multiple judges through an Evaluator. Config retrieval methods now produce real trackers (instead of no-ops) and expose the evaluator on every config type.

Four new public types and two wiring changes on LDAIClientImpl:

Runner

public interface Runner {
  RunnerResult run(String input, Map<String, Object> outputType) throws Exception;
  default RunnerResult run(String input) throws Exception;
}

Wraps a model provider SDK. The outputType parameter carries a JSON-Schema-like map when structured output is needed (e.g., a judge requesting {score, reasoning}). The single-arg overload delegates with outputType = null. Implementations should be thread-safe.

RunnerResult

public static Builder builder(String content, AIMetrics metrics);

Immutable result of a Runner invocation. Holds content (response text), metrics (AIMetrics from the tracker layer), optional raw (unmodified provider response), and optional parsed (structured output as Map<String, Object>). The parsed map is defensively copied and returned as unmodifiable.

Judge

public Judge(AIJudgeConfig config, Runner runner, LDLogger logger);

public JudgeResult evaluate(String input, String output);
public JudgeResult evaluate(String input, String output, double samplingRate);
public JudgeResult evaluateMessages(List<Message> messages, RunnerResult response);
public JudgeResult evaluateMessages(List<Message> messages, RunnerResult response, double samplingRate);

Evaluates an AI model output by invoking a runner with a formatted evaluation prompt and parsing the structured response. Sampling gate runs first — below the rate, returns sampled=false immediately. Otherwise formats the input/output, creates a fresh tracker via config.createTracker(), and delegates to tracker.trackMetricsOf with EVALUATION_SCHEMA. Parses score (Number, must be in [0.0, 1.0]) and reasoning (String, optional) from the structured response.

Runner exceptions are caught and returned as JudgeResult(success=false, errorMessage=...) — judge failures are results, not exceptions. The judge does not call trackJudgeResult; that is the caller's responsibility.

evaluateMessages formats a List<Message> as role: content lines joined by newlines, then delegates to evaluate.

Evaluator

public static Evaluator noop();
public Evaluator(Map<String, Judge> judges, JudgeConfiguration judgeConfiguration, LDLogger logger);

public CompletableFuture<List<JudgeResult>> evaluate(String input, String output);

Coordinates execution of a set of judges against a single input/output pair. Judges run sequentially in the order specified by the JudgeConfiguration. Missing judges are skipped with a warning. The noop singleton returns a completed future holding an empty list immediately.

The evaluator does not call trackJudgeResult — that is the managed type's responsibility (Plan 4).

For v1.0, all configs receive Evaluator.noop() since there is no RunnerFactory to auto-create runners for judge configs.

Tracker factory wiring

LDAIClientImpl now creates real LDAIConfigTrackerImpl instances instead of no-ops. A private trackerFactory method captures the config identity (configKey, variationKey, version, modelName, providerName, context) and returns a Supplier<LDAIConfigTracker> that produces a fresh tracker with a new runId on each call.

Default configs also receive real trackers — the configKey was requested even if no flag was found; variationKey is null.

createTracker on LDAIClient

LDAIConfigTracker createTracker(String resumptionToken, LDContext context);

Reconstructs a tracker from a resumption token for multi-turn or streaming scenarios that span multiple requests. Delegates to LDAIConfigTrackerImpl.fromResumptionToken.

Tracker hardening

  • trackMetricsOf now stops the clock before running the metrics extractor, so a slow extractor doesn't inflate the duration. If the extractor throws, operation duration is still recorded before the exception propagates.
  • Null-argument track calls (trackDuration(null), trackFeedback(null), etc.) downgraded from warn to debug per spec.
  • trackJudgeResult now guards against blank metricKey and infinite/NaN score.
  • Resumption token decode validates non-empty runId and configKey.

Config type changes

AIConfig base class gains an Evaluator field and getEvaluator() accessor. AICompletionConfig and AIAgentConfig constructors accept an Evaluator. AIJudgeConfig always wires Evaluator.noop() internally — judges do not evaluate themselves.

NoOpAIConfigTracker removed — all configs now get real LDAIConfigTrackerImpl instances.

Test plan

  • ./gradlew :lib:sdk:server-ai:test passes
  • JudgeTest covers the judge surface:
    • successfulEvaluationReturnsCorrectScore / scoreBoundaryZeroIsValid / scoreBoundaryOneIsValid
    • reasoningIsOptional
    • runnerExceptionResultsInFailure — caught, not rethrown
    • nullParsedOutputResultsInFailure / missingScoreResultsInFailure
    • scoreAboveOneResultsInFailure / scoreBelowZeroResultsInFailure
    • samplingRateZeroAlwaysSkips / samplingRateOneAlwaysRuns
    • evaluateMessagesFormatsCorrectly
    • getConfigReturnsOriginal / getRunnerReturnsOriginal
  • EvaluatorTest covers the evaluator surface:
    • noopReturnsEmptyList / noopReturnsSameInstance / noopFutureIsAlreadyDone
    • singleJudgeIsRun / multipleJudgesAreAllRun
    • missingJudgeIsSkipped
    • evaluatorDoesNotCallTrackJudgeResult
    • returnedFutureIsAlreadyDone
  • RunnerResultTest covers builder, immutability, and defensive copy

@mattrmc1 mattrmc1 changed the base branch from main to mmccarthy/AIC-2664/ai-config-tracker-overhaul June 23, 2026 21:13
@mattrmc1 mattrmc1 marked this pull request as ready for review June 23, 2026 21:13
@mattrmc1 mattrmc1 requested a review from a team as a code owner June 23, 2026 21:13
@mattrmc1 mattrmc1 marked this pull request as draft June 23, 2026 21:14

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.

.success(true)
.judgeConfigKey(config.getKey())
.metricKey(config.getEvaluationMetricKey())
.score(score);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN scores pass validation

High Severity

When parsed score is a non-finite number such as NaN, the range check against [0.0, 1.0] does not reject it, so the judge returns success=true with that score. Downstream trackJudgeResult can then emit a metric with a non-finite value.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.

this.judges = Objects.requireNonNull(judges, "judges");
this.judgeConfiguration = Objects.requireNonNull(judgeConfiguration, "judgeConfiguration");
this.logger = Objects.requireNonNull(logger, "logger");
this.isNoop = false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluator retains mutable judges map

Medium Severity

The Evaluator constructor stores the supplied judges map by reference without a defensive or unmodifiable copy, while the type is documented as immutable. Callers can mutate or replace entries after construction, changing which judges run or causing concurrent modification during evaluation.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.

mattrmc1 added 3 commits June 23, 2026 17:33
…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant