feat: add Runner, RunnerResult, Judge, and Evaluator#180
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.
| .success(true) | ||
| .judgeConfigKey(config.getKey()) | ||
| .metricKey(config.getEvaluationMetricKey()) | ||
| .score(score); |
There was a problem hiding this comment.
NaN scores pass validation
High Severity
When parsed score is a non-finite number such as NaN, the range check against [0.0, 1.0] does not reject it, so the judge returns success=true with that score. Downstream trackJudgeResult can then emit a metric with a non-finite value.
Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.
| this.judges = Objects.requireNonNull(judges, "judges"); | ||
| this.judgeConfiguration = Objects.requireNonNull(judgeConfiguration, "judgeConfiguration"); | ||
| this.logger = Objects.requireNonNull(logger, "logger"); | ||
| this.isNoop = false; |
There was a problem hiding this comment.
Evaluator retains mutable judges map
Medium Severity
The Evaluator constructor stores the supplied judges map by reference without a defensive or unmodifiable copy, while the type is documented as immutable. Callers can mutate or replace entries after construction, changing which judges run or causing concurrent modification during evaluation.
Reviewed by Cursor Bugbot for commit 1a7e1f6. Configure here.
…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals


Summary
Adds the AIEVALS types and wires up the tracker factory. Callers can now implement a
Runnerto wrap any model provider, construct aJudgeto evaluate AI outputs against a judge prompt with structured{score, reasoning}output, and coordinate multiple judges through anEvaluator. Config retrieval methods now produce real trackers (instead of no-ops) and expose the evaluator on every config type.Four new public types and two wiring changes on
LDAIClientImpl:RunnerWraps a model provider SDK. The
outputTypeparameter carries a JSON-Schema-like map when structured output is needed (e.g., a judge requesting{score, reasoning}). The single-arg overload delegates withoutputType = null. Implementations should be thread-safe.RunnerResultImmutable result of a
Runnerinvocation. Holdscontent(response text),metrics(AIMetricsfrom the tracker layer), optionalraw(unmodified provider response), and optionalparsed(structured output asMap<String, Object>). Theparsedmap is defensively copied and returned as unmodifiable.JudgeEvaluates an AI model output by invoking a runner with a formatted evaluation prompt and parsing the structured response. Sampling gate runs first — below the rate, returns
sampled=falseimmediately. Otherwise formats the input/output, creates a fresh tracker viaconfig.createTracker(), and delegates totracker.trackMetricsOfwithEVALUATION_SCHEMA. Parsesscore(Number, must be in [0.0, 1.0]) andreasoning(String, optional) from the structured response.Runner exceptions are caught and returned as
JudgeResult(success=false, errorMessage=...)— judge failures are results, not exceptions. The judge does not calltrackJudgeResult; that is the caller's responsibility.evaluateMessagesformats aList<Message>asrole: contentlines joined by newlines, then delegates toevaluate.EvaluatorCoordinates execution of a set of judges against a single input/output pair. Judges run sequentially in the order specified by the
JudgeConfiguration. Missing judges are skipped with a warning. The noop singleton returns a completed future holding an empty list immediately.The evaluator does not call
trackJudgeResult— that is the managed type's responsibility (Plan 4).For v1.0, all configs receive
Evaluator.noop()since there is noRunnerFactoryto auto-create runners for judge configs.Tracker factory wiring
LDAIClientImplnow creates realLDAIConfigTrackerImplinstances instead of no-ops. A privatetrackerFactorymethod captures the config identity (configKey,variationKey,version,modelName,providerName,context) and returns aSupplier<LDAIConfigTracker>that produces a fresh tracker with a newrunIdon each call.Default configs also receive real trackers — the
configKeywas requested even if no flag was found;variationKeyisnull.createTrackeronLDAIClientReconstructs a tracker from a resumption token for multi-turn or streaming scenarios that span multiple requests. Delegates to
LDAIConfigTrackerImpl.fromResumptionToken.Tracker hardening
trackMetricsOfnow stops the clock before running the metrics extractor, so a slow extractor doesn't inflate the duration. If the extractor throws, operation duration is still recorded before the exception propagates.trackDuration(null),trackFeedback(null), etc.) downgraded fromwarntodebugper spec.trackJudgeResultnow guards against blankmetricKeyand infinite/NaNscore.runIdandconfigKey.Config type changes
AIConfigbase class gains anEvaluatorfield andgetEvaluator()accessor.AICompletionConfigandAIAgentConfigconstructors accept anEvaluator.AIJudgeConfigalways wiresEvaluator.noop()internally — judges do not evaluate themselves.NoOpAIConfigTrackerremoved — all configs now get realLDAIConfigTrackerImplinstances.Test plan
./gradlew :lib:sdk:server-ai:testpassesJudgeTestcovers the judge surface:successfulEvaluationReturnsCorrectScore/scoreBoundaryZeroIsValid/scoreBoundaryOneIsValidreasoningIsOptionalrunnerExceptionResultsInFailure— caught, not rethrownnullParsedOutputResultsInFailure/missingScoreResultsInFailurescoreAboveOneResultsInFailure/scoreBelowZeroResultsInFailuresamplingRateZeroAlwaysSkips/samplingRateOneAlwaysRunsevaluateMessagesFormatsCorrectlygetConfigReturnsOriginal/getRunnerReturnsOriginalEvaluatorTestcovers the evaluator surface:noopReturnsEmptyList/noopReturnsSameInstance/noopFutureIsAlreadyDonesingleJudgeIsRun/multipleJudgesAreAllRunmissingJudgeIsSkippedevaluatorDoesNotCallTrackJudgeResultreturnedFutureIsAlreadyDoneRunnerResultTestcovers builder, immutability, and defensive copy