Omnibin - Comprehensive ML Metrics Report Generator
Generate detailed evaluation reports for classification, regression, segmentation, and detection tasks.
Binary Classification Report
Upload a CSV with 'y_true' (0/1) and 'y_pred' (0-1 probability) columns.
Regression Report
Upload a CSV with 'y_true' and 'y_pred' columns (continuous values).
Object Detection Report
Upload a JSON file with the following structure:
{
"predictions": [
[{"box": [x1,y1,x2,y2], "score": 0.9}, ...],
...
],
"ground_truths": [
[[x1,y1,x2,y2], ...],
...
]
}
Text Generation Report (Radiology Reports)
Upload a CSV with reference and candidate columns. Lexical metrics (BLEU / ROUGE / METEOR / BERTScore) run locally. LLM-judge metrics (GREEN, RadFact, CRIMSON) require an API key.
Paper-default judge models: GREEN → StanfordAIMI/GREEN-RadLlama2-7b (Ostmeier et al., 2024); RadFact → Llama-3-70B-Instruct (Bannur et al., 2024); CRIMSON → MedGemmaCRIMSON-4B (Baharoon et al., 2026).
⚠️ DISCLAIMER — API-mode judges are not paper defaults.
The hosted Space runs GREEN and RadFact in API mode: we use each paper's verbatim prompt (GREEN) or system messages (RadFact) but route them to a generic provider you select (OpenAI / Anthropic / Google / OpenRouter / Groq). Scores can differ from published numbers, because the original papers fine-tuned their own judge models.
- GREEN API mode: verbatim prompt + parser, only the judge model changes.
- RadFact API mode: verbatim two-stage system prompts but with zero-shot JSON output (the upstream
radfactpackage teaches output format with 10-shot YAML). Only logical precision/recall/F1; no grounding or spatial scores.- CRIMSON API mode: verbatim prompt + JSON parser + scoring formula (vendored from rajpurkarlab/CRIMSON), only the judge model changes. We use API mode on this Space because
crimson-scorepinspandas>=3.0.1(conflicts with Gradio'spandas<3.0); tracked upstream as rajpurkarlab/CRIMSON#3.
For exact paper reproduction:pip install omnibin[green](needs GPU),pip install omnibin[radfact](needs a separate env due to pydantic 1.x / Gradio conflict), orpip install omnibin[crimson](separate env until the pandas pin is relaxed).
Download example file to try:
| text_generation_example .csv | 4.2 KB ⇣ |
Aggregate Scores
Submetric Breakdown
Omnibin v0.3.1 - Comprehensive ML evaluation metrics with healthcare focus