llm-evaluation-lab
Config-driven workbench for pairwise response scoring, RLHF-style rubrics, JSONL validation, multi-annotator agreement, disagreement cases, failure tags, and Markdown/JSON reports.
- Computed-vs-human preference checks
- Pairwise annotator agreement metrics
- CI plus unit-tested CLI workflow