Add latent-objective recognition eval to multi-objective skill#1442
Add latent-objective recognition eval to multi-objective skill#1442cafzal wants to merge 2 commits into
Conversation
Signed-off-by: cafzal <cameron.afzal@gmail.com>
…002/004 house style) Signed-off-by: cafzal <cameron.afzal@gmail.com>
📝 WalkthroughWalkthroughA single new evaluation case, ChangesLatent Objective Eval Case
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@rgsl888prabhu small skill eval – adds the latent-objective case (the under-trigger boundary the other four miss) to cuopt-multi-objective-exploration, mirroring the existing four. Ready when you/Miles have a cycle. |
Description
Adds a fifth eval to the
cuopt-multi-objective-explorationskill —multiobj-explore-eval-005-latent-objective— covering the boundary the existing four don't: a problem stated with a single objective while a second objective sits latent in the data, unstated.The current evals all hand the agent both objectives (001 interpret, 002 explore, 004 dual-as-slope) or are explicitly single-objective (003 decoy). None test recognizing a latent objective. This one grades whether the skill makes the agent surface the latent cost objective and trace the supply-vs-cost frontier — rather than optimizing the stated objective alone or silently folding cost into a self-chosen weighted blend (
maximize supply − λ·cost). It brackets the skill's activation boundary opposite the 003 decoy.Behavioral eval (
expected_script: null, LLM-graded on the behavior list), same house style as 001/002/004;validate_skills.shpicks up the new array entry and the signature /BENCHMARK.md/ skill-card regenerate via NVSkills-Eval. The latent-objective shape is the max-supply supply-vs-cost case validated on cuOpt (Tesla T4) in NVIDIA/cuopt-examples#157.Checklist