eval.frameworkCALIBRATEDDomain-grounded evaluation
Ground-truth datasets engineered against your compliance perimeter — not public leaderboards.
Token-level scoring, cohort segmentation, and counterfactual diffs run inside your VPC. Every metric is auditable and traceable to a decision flow your domain experts signed off on.
token_level_scoring(corpus, rubric)ground_truth_calibration(SME_review)cohort_segmented_eval(by=jurisdiction)counterfactual_diff(baseline, candidate)
eval-report-v3.jsonfeeds [02]