Gradio

Binary Classification Report

Upload a CSV with 'y_true' (0/1) and 'y_pred' (0-1 probability) columns.

Download example file to try:

Example: classification_example.csv

classification_example .csv

10.4 KB ⇣

Upload CSV

Bootstrap Iterations

DPI

Color Scheme

Report PDF

ROC/PR Curves

Metrics vs Threshold

Confusion Matrix

Calibration

Prediction Distribution

Metrics Summary

Regression Report

Upload a CSV with 'y_true' and 'y_pred' columns (continuous values).

Download example file to try:

Example: regression_example.csv

regression_example .csv

3.3 KB ⇣

Upload CSV

Bootstrap Iterations

DPI

Color Scheme

Report PDF

Scatter Plot

Residuals

Q-Q Plot

Bland-Altman

Error Distribution

Metrics Summary

Segmentation Report

Upload two NumPy (.npy) files: ground truth mask and prediction mask.

Download example files to try:

Example: Ground Truth (2D)

segmentation_2d_ground_truth .npy

64.1 KB ⇣

Example: Prediction (2D)

segmentation_2d_prediction .npy

64.1 KB ⇣

Ground Truth Mask (.npy)

Prediction Mask (.npy)

Bootstrap Iterations

DPI

Color Scheme

Report PDF

Segmentation Comparison

Confusion Matrix

Metrics Bar Chart

Surface Distance

Metrics Summary

Object Detection Report

Upload a JSON file with the following structure:

{
    "predictions": [
        [{"box": [x1,y1,x2,y2], "score": 0.9}, ...],
        ...
    ],
    "ground_truths": [
        [[x1,y1,x2,y2], ...],
        ...
    ]
}

Download example file to try:

Example: detection_example.json

detection_example .json

15.4 KB ⇣

Upload JSON

Bootstrap Iterations

DPI

Color Scheme

Report PDF

Precision-Recall Curve

FROC Curve

IoU Distribution

Confidence Distribution

Metrics Summary

Text Generation Report (Radiology Reports)

Upload a CSV with reference and candidate columns. Lexical metrics (BLEU / ROUGE / METEOR / BERTScore) run locally. LLM-judge metrics (GREEN, RadFact, CRIMSON) require an API key.

Paper-default judge models: GREEN → StanfordAIMI/GREEN-RadLlama2-7b (Ostmeier et al., 2024); RadFact → Llama-3-70B-Instruct (Bannur et al., 2024); CRIMSON → MedGemmaCRIMSON-4B (Baharoon et al., 2026).

⚠️ DISCLAIMER — API-mode judges are not paper defaults.
The hosted Space runs GREEN and RadFact in API mode: we use each paper's verbatim prompt (GREEN) or system messages (RadFact) but route them to a generic provider you select (OpenAI / Anthropic / Google / OpenRouter / Groq). Scores can differ from published numbers, because the original papers fine-tuned their own judge models.

GREEN API mode: verbatim prompt + parser, only the judge model changes.

RadFact API mode: verbatim two-stage system prompts but with zero-shot JSON output (the upstream radfact package teaches output format with 10-shot YAML). Only logical precision/recall/F1; no grounding or spatial scores.

CRIMSON API mode: verbatim prompt + JSON parser + scoring formula (vendored from rajpurkarlab/CRIMSON), only the judge model changes. We use API mode on this Space because crimson-score pins pandas>=3.0.1 (conflicts with Gradio's pandas<3.0); tracked upstream as rajpurkarlab/CRIMSON#3.
For exact paper reproduction: pip install omnibin[green] (needs GPU), pip install omnibin[radfact] (needs a separate env due to pydantic 1.x / Gradio conflict), or pip install omnibin[crimson] (separate env until the pandas pin is relaxed).

Download example file to try:

Example: text_generation_example.csv (10 fabricated CXR pairs)

text_generation_example .csv

4.2 KB ⇣

Upload CSV

Metrics

Pick which metrics to compute. GREEN / RadFact / CRIMSON require an API key — they run in API mode with your chosen provider (see disclaimer above).

LLM Provider (for GREEN / RadFact / CRIMSON)

Model Name (optional — defaults per provider)

API Key

Bootstrap Iterations

DPI

Color Scheme

Report PDF

Metrics Summary

Per-sample Distribution

Metric Correlation

Per-sample Heatmap

Aggregate Scores

Aggregate Scores

Submetric Breakdown

Submetric Breakdown

Omnibin - Comprehensive ML Metrics Report Generator

Binary Classification Report

Regression Report

Segmentation Report

Object Detection Report

Text Generation Report (Radiology Reports)