Featured

llm-evaluation

Name: llm-evaluation
Author: wshobson

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

TL;DR

Implement evaluation strategies for LLM applications using automated metrics, human feedback, and LLM-as-judge approaches.

What Is It

Automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall/F1)
Human evaluation (accuracy, coherence, relevance, safety, helpfulness)
LLM-as-judge (pointwise, pairwise, reference-based comparisons)
Benchmarking for performance, regression detection, and quality assurance

Key Features

Systematic performance measurement
Scalable automated scoring with metrics like BLEU and ROUGE
Human-in-the-loop for quality dimensions
LLM-based judgment of output quality
Support for grounding, toxicity, and factuality checks
Baseline establishment and progress tracking

How to Use

1.Measure model performance in testing
2.Compare model variants or prompt changes
3.Detect performance regressions
4.Validate improvements from prompt tuning
5.Build confidence in production systems
6.Debug unexpected behaviors
7.Establish evaluation frameworks

Guardrails & Gotchas

Automated metrics may miss nuanced quality issues
Human evaluation is costly and subjective
LLM-as-judge can introduce bias or hallucinate
Ensure grounding and safety checks are robust
Use multiple metrics to avoid over-reliance on single scores
Validate results with domain-specific knowledge

GitHub Stats

Author

@wshobson

Repository

wshobson/agents

Stars

20990

Explore More Skills

api-design-principles

@wshobson

Master REST and GraphQL API design principles to build intuitive, scalable, and maintainable APIs that delight developers. Use when designing new APIs, reviewing API specifications, or establishing API design standards.

21k

2.3k

2mo ago

llm-evaluation

Explore More Skills

api-design-principles

prd

revenuecat