Skip to content

Add model evaluations SDK and CLI#475

Merged
leeclemnet merged 5 commits intomainfrom
lee/model-evals-cli-sdk
May 7, 2026
Merged

Add model evaluations SDK and CLI#475
leeclemnet merged 5 commits intomainfrom
lee/model-evals-cli-sdk

Conversation

@leeclemnet
Copy link
Copy Markdown
Contributor

@leeclemnet leeclemnet commented May 7, 2026

Summary

Adds Python SDK + CLI bindings for the new public Model Evaluations REST API. Mirrors every UI panel in the app's evaluation page as a method on ModelEval and a subcommand under roboflow eval.

This is PR 4 of a 4-PR stack:

# PR Repo Status
1 DNA edits roboflow ✅ merged (roboflow#11602)
2 REST API implementation + tests roboflow open (roboflow#11636)
3 MCP tools roboflow-mcp open (roboflow-mcp#36)
4 CLI + SDK + tests (this PR) roboflow-python open
docs Public-facing reference roboflow-dev-reference open (dev-reference#18)

Important

Blocked on roboflow#11636 deploying. The CLI/SDK calls REST endpoints that ship in that PR — until 11636 merges and deploys to a given environment, calls return Unsupported request. GET /:workspace/model-evals does not exist.... For pre-deploy testing, set API_URL=https://localapi.roboflow.one (the local dev server hosts 11636).

SDK

import roboflow
ws = roboflow.Roboflow(api_key=...).workspace("my-workspace")

# List + get
evals = ws.evals(status="done", limit=10)        # → list[ModelEval]
e = ws.eval("huUF720inUcymARwqAGK")              # → ModelEval

# Headline metrics on done evals (no extra round-trip)
e.summary  # → {"mAP": 0.92, "precision": 0.85, "recall": 0.85}

# Panel data — one method per UI panel
e.map_results()
e.confidence_sweep()
e.performance_by_class(split="test")
e.confusion_matrix(split="test", confidence=20)
e.vector_analysis(confidence=20)
e.image_predictions(split="test", limit=200)
e.recommendations()

Typed errors so callers don't parse strings: ModelEvalNotFoundError, ModelEvalNotDoneError, InvalidSplitError, InvalidConfidenceError (all subclass RoboflowError).

CLI

roboflow eval list [-p PROJECT] [-v VERSION] [-m MODEL] [-s STATUS] [-n LIMIT]
roboflow eval get <eval_id>
roboflow eval map-results <eval_id>
roboflow eval confidence-sweep <eval_id>
roboflow eval performance-by-class <eval_id> [-s SPLIT]
roboflow eval confusion-matrix <eval_id> [-s SPLIT] [-c CONFIDENCE]
roboflow eval vector-analysis <eval_id> [-c CONFIDENCE]
roboflow eval image-predictions <eval_id> [-s SPLIT] [-c CONF] [-n LIMIT] [-o OFFSET]
roboflow eval recommendations <eval_id>

Honors the standard global flags: --json/-j, --workspace/-w, --api-key/-k, --quiet/-q. Exit codes match error categories: 3 = not found, 4 = not done, 5 = invalid input, 2 = missing workspace/auth.

list and performance-by-class render ASCII tables for human output; the dense panels (map-results, confusion-matrix, vector-analysis, image-predictions, recommendations, confidence-sweep) pretty-print JSON since their nested shapes don't tabulate cleanly. Full structured access via --json.

Tests

  • 57 new tests (full suite: 610 passed / 1 skipped).
  • Adapter tests (20): URL/param plumbing for each endpoint, all four typed errors, status-code fallback for unknown 404s, non-JSON body fallback.
  • SDK tests (18): construction with/without info, refresh chainability, each panel method's adapter delegation, error propagation, all Workspace.evals/eval filter forwarding.
  • CLI tests (19): registration / --help for every subcommand, list filters, JSON vs text output, exit-code mapping for 404/409/400 across handlers, panel-arg forwarding for all 7 panels.

Live verification

Verified against localapi.roboflow.one with the test API key against workspace lee-sandbox. All 9 endpoints return correct data through both SDK and CLI. Error paths verified: eval get NOT_A_REAL_ID → exit 3 with model_eval_not_found; eval map-results <running-id> → exit 4 with model_eval_not_done; eval performance-by-class <id> --split all → exit 5 with invalid_split; eval confusion-matrix <id> --confidence 999 → exit 5 with invalid_confidence. SDK round-trip through Workspace.evals() / Workspace.eval().panel() confirmed end-to-end including the typed ModelEvalNotDoneError raise.

🤖 Generated with Claude Code

Live verification

Transcript: all 9 subcommands + 4 error paths against local dev server
══════════════════════════════════════════════════════════════════════════
 roboflow eval — CLI transcript (workspace: lee-sandbox)
   $ export API_URL=https://localapi.roboflow.one
   $ export ROBOFLOW_API_KEY=…
══════════════════════════════════════════════════════════════════════════

── $ roboflow eval list --workspace lee-sandbox --limit 3
ID                    STATUS  PROJECT               VERSION  MODEL                 CREATED
--------------------  ------  --------------------  -------  --------------------  ------------------------
huUF720inUcymARwqAGK  done    kBRjxrPAMU4f9xmR16ws  4                              2026-04-27T20:04:10.904Z
ViZ1tozq4XBDt0yjYPnF  done    7vSK9kc3i6eFm700KWFt  63                             2026-04-24T21:39:11.238Z
342EYOewHnu8J6HIzvNy  done    7vSK9kc3i6eFm700KWFt  60       Z9HwYHqrpMMoCtmAo4pO  2026-04-24T21:34:08.958Z

── $ roboflow eval list --workspace lee-sandbox --status done --limit 2
ID                    STATUS  PROJECT               VERSION  MODEL  CREATED
--------------------  ------  --------------------  -------  -----  ------------------------
huUF720inUcymARwqAGK  done    kBRjxrPAMU4f9xmR16ws  4               2026-04-27T20:04:10.904Z
ViZ1tozq4XBDt0yjYPnF  done    7vSK9kc3i6eFm700KWFt  63              2026-04-24T21:39:11.238Z

── $ roboflow eval get huUF720inUcymARwqAGK --workspace lee-sandbox
Eval: huUF720inUcymARwqAGK
  Status:  done
  Project: kBRjxrPAMU4f9xmR16ws
  Version: 4
  Model:   (none)
  Created: 2026-04-27T20:04:10.904Z
  Config:  overlap=30 iouThreshold=50
  Summary: mAP=0.9239650566041828 precision=0.85 recall=0.85

── $ roboflow eval get huUF720inUcymARwqAGK --workspace lee-sandbox --json | jq '.summary'
{
  "mAP": 0.9239650566041828,
  "precision": 0.85,
  "recall": 0.85
}

── $ roboflow eval performance-by-class huUF720inUcymARwqAGK --workspace lee-sandbox
Split: test
CLASS       mAP50   mAP50-95  mAP75   P       R       F1      OPT_THR
----------  ------  --------  ------  ------  ------  ------  -------
Car-rims    0.9240  0.7555    0.9240  0.8500  0.8500  0.8500  0.3700
music-note                            0.0000  0.0000  0.0000  0.5000

── $ roboflow eval map-results huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.splits.test | {map50,map50_95,map75}'
{
  "map50": 0.9239650566041828,
  "map50_95": 0.7555258345429926,
  "map75": 0.9239650566041828
}

── $ roboflow eval confidence-sweep huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.splits.test.optimalThreshold,.splits.test.optimalMetrics'
0.37
{
  "precision": 0.85,
  "recall": 0.85,
  "f1": 0.85
}

── $ roboflow eval confusion-matrix huUF720inUcymARwqAGK -w lee-sandbox --json | jq '{split,confidenceThreshold,classes,matrix}'
{
  "split": "all",
  "confidenceThreshold": 0.2,
  "classes": ["Car-rims", "music-note", "background"],
  "matrix": [
    [ 382,   0,   2],
    [   0,   0,   5],
    [1336,   1,   0]
  ]
}

── $ roboflow eval vector-analysis huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.clustering'
{
  "method": "hdbscan",
  "nClusters": 54,
  "metrics": { "noiseRatio": 0.078, "silhouetteScore": 0.489 },
  "parameters": {
    "min_cluster_size": 2,
    "min_samples": 1,
    "cluster_selection_method": "eom",
    "metric": "euclidean"
  },
  "processingTimeSeconds": 8.36
}
# .clusters has 55 entries (54 + the noise bucket id=-1)
# each cluster: { id, numImages, splitDistribution, metrics: {f1Mean, f1Std, f1Min, f1Max, precisionMean, recallMean}, sampleImages: [...] }

── $ roboflow eval image-predictions huUF720inUcymARwqAGK -w lee-sandbox --split test --limit 1 --json | jq '.images[0]'
{
  "imageId": "1QKLCUsfAzFiCIb6YCJj",
  "imageName": "-B59BC424-…-png_jpg.rf.b1444301….jpg",
  "split": "test",
  "augmentations": 2,
  "cluster": { "id": 4, "embedding2D": [7.495, -5.144] },
  "stats": {
    "truePositives": 2, "falsePositives": 7, "falseNegatives": 0,
    "precision": 0.222, "recall": 1.0, "f1": 0.364
  },
  "confusion": [[0, 0, 2], [2, 0, 7]]
}

── $ roboflow eval recommendations huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.recommendations.summary'
{
  "confidenceThreshold": 37,
  "split": "test",
  "generatedAt": "2026-04-27T20:05:37.512Z",
  "count": 3,
  "f1": 0.85,
  "precision": 0.85,
  "recall": 0.85
}
# .recommendations.items has 3 entries:
#   { type: "missed_detection",  analysis: { affected_class: "Car-rims", count: 3 } }
#   { type: "class_imbalance",   analysis: { affected_class: "Car-rims", current_count: 20, … } }
#   …

═══════════════════════════════════════════════════════════════════════════
                              ERROR PATHS
═══════════════════════════════════════════════════════════════════════════

── $ roboflow eval get NOT_REAL --workspace lee-sandbox
Error: Model evaluation not found
  Hint: Run 'roboflow eval list' to see eval ids in this workspace.
$? = 3

── $ roboflow eval map-results fNyWx6PC74rCc18IuZ3M --workspace lee-sandbox   # running eval
Error: Model evaluation has not completed
  Hint: Wait for the eval to finish (status='done') before reading panel data.
$? = 4

── $ roboflow eval performance-by-class huUF720inUcymARwqAGK -w lee-sandbox --split all
Error: Invalid split: must be one of train, valid, test
  Hint: Use one of: train, valid, test (or 'all' where supported).
$? = 5

── $ roboflow eval confusion-matrix huUF720inUcymARwqAGK -w lee-sandbox --confidence 999
Error: Invalid confidence threshold: must be an integer percentage in [0, 100]
  Hint: Pass an integer between 0 and 100.
$? = 5

Exit codes: 0 success · 2 missing workspace/auth · 3 not found · 4 not done · 5 invalid input.

Live verification — staging (api.roboflow.one)

All 9 roboflow eval subcommands exercised against the staging API on commit 6eea8aa. Workspace lee-sandbox, eval huUF720inUcymARwqAGK. Verbose outputs (confidence-sweep, vector-analysis) are truncated for readability — full payloads available via --json from the same command.

$ roboflow --workspace lee-sandbox eval list --limit 5
ID                    STATUS  PROJECT               VERSION  MODEL                 CREATED                 
--------------------  ------  --------------------  -------  --------------------  ------------------------
huUF720inUcymARwqAGK  done    kBRjxrPAMU4f9xmR16ws  4                              2026-04-27T20:04:10.904Z
ViZ1tozq4XBDt0yjYPnF  done    7vSK9kc3i6eFm700KWFt  63                             2026-04-24T21:39:11.238Z
342EYOewHnu8J6HIzvNy  done    7vSK9kc3i6eFm700KWFt  60       Z9HwYHqrpMMoCtmAo4pO  2026-04-24T21:34:08.958Z
pX6mnyL185mgbKM6V49G  done    7vSK9kc3i6eFm700KWFt  62                             2026-04-24T21:33:19.025Z
UcVhAMxwiV9yctz5QBNM  done    7vSK9kc3i6eFm700KWFt  61                             2026-04-24T20:58:07.978Z
$ roboflow --workspace lee-sandbox eval get huUF720inUcymARwqAGK
Eval: huUF720inUcymARwqAGK
  Status:  done
  Project: kBRjxrPAMU4f9xmR16ws
  Version: 4
  Model:   (none)
  Created: 2026-04-27T20:04:10.904Z
  Summary: mAP=0.9239650566041828 precision=0.85 recall=0.85
$ roboflow --workspace lee-sandbox eval map-results huUF720inUcymARwqAGK
{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "splits": {
    "train": {
      "map50": 0.42346596111761736,
      "map50_95": 0.3144300021480774,
      "map75": 0.3746226767482993,
      "byObjectSize": {
        "small": {
          "map50": 0.43721019833036057,
          "map50_95": 0.28794722150177027,
          "map75": 0.3551395358136276
        },
        "medium": {
          "map50": 0.4586843319371667,
          "map50_95": 0.35248557062640906,
          "map75": 0.4150174709843547
        },
        "large": null
      },
      "perClass": {
        "Car-rims": {
          "map50": 0.8469319222352347,
          "map50_95": 0.6288600042961547,
          "map75": 0.7492453534965983,
          "byObjectSize": {
            "small": {
              "map50": 0.8744203966607211,
              "map50_95": 0.5758944430035405,
              "map75": 0.7102790716272552
            },
            "medium": {
              "map50": 0.9173686638743335,
              "map50_95": 0.7049711412528181,
              "map75": 0.8300349419687095
            },
            "large": null
          }
        },
        "music-note": {
          "map50": 0,
          "map50_95": 0,
          "map75": 0,
          "byObjectSize": {
            "small": {
              "map50": 0,
              "map50_95": 0,
              "map75": 0
            },
            "medium": {
              "map50": 0,
              "map50_95": 0,
              "map75": 0
            },
            "large": null
          }
        }

[... 100 more lines truncated ...]
$ roboflow --workspace lee-sandbox eval confidence-sweep huUF720inUcymARwqAGK
{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "splits": {
    "train": {
      "perThreshold": {
        "0.00": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.01": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.02": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.03": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.04": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.05": {
          "precision": 0.011,
          "recall": 0.5,
          "f1": 0.021
        },
        "0.06": {
          "precision": 0.011,

[... 4609 more lines truncated ...]
$ roboflow --workspace lee-sandbox eval performance-by-class huUF720inUcymARwqAGK --split test
Split: test
CLASS       mAP50   mAP50-95  mAP75   P       R       F1      OPT_THR
----------  ------  --------  ------  ------  ------  ------  -------
Car-rims    0.9240  0.7555    0.9240  0.8500  0.8500  0.8500  0.3700 
music-note                            0.0000  0.0000  0.0000  0.5000 
$ roboflow --workspace lee-sandbox eval confusion-matrix huUF720inUcymARwqAGK --split test
Split: test  Confidence: 0.2
{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "split": "test",
  "confidenceThreshold": 0.2,
  "classes": [
    "Car-rims",
    "music-note",
    "background"
  ],
  "matrix": [
    [
      20,
      0,
      0
    ],
    [
      0,
      0,
      0
    ],
    [
      80,
      0,
      0
    ]
  ]
}
$ roboflow --workspace lee-sandbox eval vector-analysis huUF720inUcymARwqAGK
{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "clustering": {
    "method": "hdbscan",
    "nClusters": 54,
    "metrics": {
      "noiseRatio": 0.078125,
      "silhouetteScore": 0.48925095796585083
    },
    "parameters": {
      "min_cluster_size": 2,
      "min_samples": 1,
      "cluster_selection_method": "eom",
      "metric": "euclidean"
    },
    "processingTimeSeconds": 8.360810041427612
  },
  "preprocessing": {
    "method": "umap",
    "originalDimensions": 768,
    "targetDimensions": 10,
    "nNeighbors": 30,
    "minDistance": 0.05
  },
  "clusters": [
    {
      "id": -1,
      "numImages": 15,
      "splitDistribution": {
        "train": 12,
        "valid": 2,
        "test": 1
      },
      "metrics": {
        "f1Mean": 0.46213333333333334,
        "f1Std": 0.2193821222332293,
        "f1Min": 0.129,
        "f1Max": 0.8,
        "precisionMean": 0.3301333333333334,
        "recallMean": 0.9524
      },
      "sampleImages": [
        "-3372F3CB-7432-4FE7-B9BB-DFB5D1FC3CD4-png_jpg.rf.161520bef3c23c613d2cb40845aadfc4.jpg",
        "13_jpeg_jpg.rf.03b2557177aea66e7c6dd0460ff2b4be.jpg",
        "eyJidWNrZXQiOiJkb25lZGVhbC5pZS1waG90b3MiLCJlZGl0cyI6eyJ0b0Zvcm1hdCI6ImpwZWciLCJyZXNpemUiOnsiZml0IjoiY292ZXIiLCJ3aWR0aCI6NjAwLCJoZWlnaHQiOjQ1MH19LCJrZXkiOiJwaG90b18yNTczNTQ4ODUifQ-_jpeg_jpg.rf.b4d397020d5343286735e37ca13d4176.jpg",
        "eyJidWNrZXQiOiJkb25lZGVhbC5pZS1waG90b3MiLCJlZGl0cyI6eyJ0b0Zvcm1hdCI6ImpwZWciLCJyZXNpemUiOnsiZml0IjoiY292ZXIiLCJ3aWR0aCI6NjAwLCJoZWlnaHQiOjQ1MH19LCJrZXkiOiJwaG90b18yNTczNTQ4ODUifQ-_jpeg_jpg.rf.e37e70360ac16b4f686171de52a82162.jpg",
        "11_jpeg_jpg.rf.188e9f7f9deb715cb6a180bb091babaf.jpg"
      ]
    },
    {
      "id": 0,
      "numImages": 3,
      "splitDistribution": {
        "train": 2,
        "valid": 1
      },
      "metrics": {

[... 1097 more lines truncated ...]
$ roboflow --workspace lee-sandbox eval image-predictions huUF720inUcymARwqAGK --split test --limit 2
Split: test  Confidence: 0.2  Total: 10  Offset: 0  Limit: 2
IMAGE                              SPLIT  TP  FP  FN  CLUSTER                                                           
---------------------------------  -----  --  --  --  ------------------------------------------------------------------
-B59BC424-0AB7-4880-82A8-54AC24C…  test               {'id': 4, 'embedding2D': [7.494518280029297, -5.143994331359863]} 
anne-nygard-6t7wZAU3nuY-unsplash…  test               {'id': 10, 'embedding2D': [8.974885940551758, 10.444429397583008]}
$ roboflow --workspace lee-sandbox eval recommendations huUF720inUcymARwqAGK
{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "generated": true,
  "generatedAt": "2026-04-27T20:05:37.512Z",
  "recommendations": {
    "summary": {
      "confidenceThreshold": 37,
      "split": "test",
      "generatedAt": "2026-04-27T20:05:37.512Z",
      "count": 3,
      "f1": 0.85,
      "precision": 0.85,
      "recall": 0.85
    },
    "items": [
      {
        "id": "56bcd423-38ff-45f9-b3e0-662a71ce44e6",
        "type": "missed_detection",
        "analysis": {
          "affected_class": "Car-rims",
          "count": 3
        }
      },
      {
        "id": "150e49a8-3a61-479a-9e18-3eb751494a70",
        "type": "class_imbalance",
        "analysis": {
          "affected_class": "Car-rims",
          "current_count": 20,
          "total_gt_instances": 20,
          "median_count": 10,
          "min_count_threshold": 30,
          "relative_ratio_threshold": 3,
          "violates_min_count": true,
          "violates_relative_ratio": false,
          "all_imbalanced_classes": [
            {
              "class_name": "Car-rims",
              "count": 20
            }
          ]
        }
      },
      {
        "id": "f8964b2c-a5fa-47c2-9775-24c72de954b7",
        "type": "dataset_health",
        "analysis": {
          "split": "test",
          "current_count": 10,
          "total_images": 192,
          "current_percentage": 5,
          "min_absolute_size": 50,
          "min_percentage": 5
        }
      }
    ]
  }

[... 1 more lines truncated ...]

@leeclemnet
Copy link
Copy Markdown
Contributor Author

@claude review this PR focusing on 1) alignment of implementation to conceptual API, 2) usability of tools 3) security


# -- internal -----------------------------------------------------------

def _apply(self, info: Dict[str, Any]) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H2. ModelEval.config is documented but never assigned

  • File: roboflow/core/model_eval.py (around the _apply method, ~line 30)
  • Reported by: Claude (HIGH); also noted by OpenAI as Low ("docstring says .config is populated, but ModelEval never exposes config"). Consensus: two of three reviewers caught it; they disagree on severity.
  • Detail: The class docstring promises .config is populated by refresh(), but _apply() never sets it. Any caller doing ev.config raises AttributeError. The PR description even shows Config: overlap=30 iouThreshold=50 rendered by eval get, suggesting the field exists in the API payload — so the gap is in the SDK plumbing, not the API.
  • Fix: Add self.config = info.get("config") in _apply() (and surface it in to_dict()), or remove it from the docstring.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed .config from the class + refresh() docstrings. config was previously stripped from the public API response (per earlier review item B), so the SDK has nothing to populate. CLI's matching dead Config: overlap=… iouThreshold=… line in eval get removed too.

rows.append(
{
"image": img.get("imageName", img.get("imageId", "")),
"split": img.get("split", ""),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker This comment from automated review seems right.
Here is the API signature.

https://github.com/roboflow/roboflow/pull/11636/changes#diff-59f36a589c8b8f6567999ddf5a655fc2fe2a4dfe75c630178f86460b994aeb52R921


M1. eval image-predictions table drops TP/FP/FN

  • File: roboflow/cli/handlers/eval.py (image-predictions human renderer)
  • Reported by: OpenAI (Medium). Not flagged by Claude or Gemini.
  • Detail: The human-readable table reads tp/fp/fn keys, but the live API payload (per the PR description's transcript) uses truePositives/falsePositives/falseNegatives nested under stats. Result: the TP/FP/FN columns render blank for every row. JSON output via --json is unaffected.
  • Fix: Read from row["stats"]["truePositives"] etc., consistent with the documented payload shape, and add a CLI test that asserts non-empty TP/FP/FN cells.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renderer now reads stats.truePositives / falsePositives / falseNegatives (the actual server shape) instead of stats.tp/fp/fn. Cluster column now shows just cluster.id instead of stringifying the whole {id, embedding2D} dict. New regression test asserts non-empty TP/FP/FN cells and that embedding2D doesn't leak into the table.


# -- helpers ------------------------------------------------------------

def to_dict(self) -> Dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments that I found relevant


M4 · model_eval.py:to_dict() — no test coverage, non-trivial logic

The fallback branch in to_dict() (lines 139–152) that handles the "constructed with
no info" case uses a dict lookup to map JSON keys to Python attribute names and is
never exercised by any test. Given that this method will be the primary serialisation
path for programmatic users, it should have at minimum:

  • A test where ModelEval is constructed with info=None then to_dict() is called
  • A test where info is supplied and to_dict() is called (verifies _raw path)

The logic itself is also overcomplicated — a simple list of (json_key, attr_name)
tuples would be clearer (see Nit N1).


N1 · to_dict() fallback branch is needlessly complex

# model_eval.py lines 139–152
for key in ("status", "project", "versionId", "modelId", "createdAt", "summary"):
    attr = (
        key
        if key in {"status", "project", "summary"}
        else {
            "versionId": "version_id",
            "modelId": "model_id",
            "createdAt": "created_at",
        }[key]
    )

A flat list of (json_key, attr_name) pairs is cleaner:

_FIELD_MAP = [
    ("status", "status"),
    ("project", "project"),
    ("versionId", "version_id"),
    ("modelId", "model_id"),
    ("createdAt", "created_at"),
    ("summary", "summary"),
]
for json_key, attr_name in _FIELD_MAP:
    value = getattr(self, attr_name, None)
    if value is not None:
        data[json_key] = value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to _PUBLIC_FIELDS = ((json_key, attr_name), ...) tuple list. Added 4 tests: round-trip with payload + evalId overlay; legacy id overlay; constructor-only path serialises attrs only with None omitted; constructor-only path translates Python version_id→JSON versionId correctly.

Copy link
Copy Markdown
Contributor

@digaobarbosa digaobarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important one is the image predictions properties that are different than the API.
to_dict seems like a quick win too.

@leeclemnet leeclemnet force-pushed the lee/model-evals-cli-sdk branch from 98ca16c to aed353e Compare May 7, 2026 18:12
leeclemnet and others added 5 commits May 7, 2026 14:23
Wraps the public /{workspace}/model-evals REST surface (roboflow/roboflow#11636)
so users can read evaluation results — mAP, confidence sweep, per-class
performance, confusion matrix, vector clusters, per-image stats, and
recommendations — from Python and from the CLI without hitting the API directly.

SDK:
- Workspace.evals(...) and Workspace.eval(eval_id) accessors return ModelEval
  instances; ModelEval has one method per panel returning the raw JSON dict.
- Typed exceptions (ModelEvalNotFoundError, ModelEvalNotDoneError,
  InvalidSplitError, InvalidConfidenceError) so callers can distinguish "doesn't
  exist" from "still running" from "bad argument" without parsing strings.

CLI: roboflow eval {list, get, map-results, confidence-sweep,
performance-by-class, confusion-matrix, vector-analysis, image-predictions,
recommendations} — every command honors --json. Exit codes are stable per
error class (3=not found, 4=not done, 5=invalid arg).

Tests cover the adapter URL/param plumbing and error mapping (both flat and
nested error envelopes), the ModelEval class, the Workspace accessors, and
each CLI handler's adapter call + error path.

Companion docs in roboflow/roboflow-dev-reference#18.
The REST API returns a single flat shape {"error": "code", "message": "..."}
— the agent's original adapter accepted both flat and nested shapes for
forward-compat, but the nested shape never shipped. Drop the dead branch and
the corresponding test; replace with a status-code-fallback test that exercises
the existing 404/409 fallback paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review on roboflow#11636 affecting the SDK/CLI:
- ModelEval._apply reads evalId (legacy id fallback for forward-compat)
- to_dict emits evalId
- Workspace.evals resolves either field when constructing ModelEval
- CLI list/get handlers prefer evalId, fall back to id
- Drop the undocumented `config` attribute (not part of public DNA shape)
- Tests updated for evalId; 57 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with roboflow#11636 dropping `projectId` from the public response.

The SDK previously read `info["projectId"]` (the Firestore doc id) into
`ModelEval.project_id`. That field was a doc-id leak — the API now only
returns `project` (the URL slug) on the principle that public APIs
should not expose storage-layer ids.

Rename: `ModelEval.project_id` → `ModelEval.project`. Accept legacy
`projectId` from cached older-server responses for forward-compat.
CLI list/get handlers also pull from `project` first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues raised in review:

M1 (BLOCKER) — `eval image-predictions` table dropped TP/FP/FN. The
public API nests those counts under `stats` with camelCase keys
(`truePositives`/`falsePositives`/`falseNegatives`); the renderer was
reading `stats.tp`/`stats.fp`/`stats.fn` and silently rendering blanks.
Same line was also rendering the whole `cluster` object (including the
`embedding2D` array) in the cluster column; now renders `cluster.id`
only. Live transcript regression case added.

H2 — `ModelEval`'s class docstring promised `.config` was populated by
`refresh()`, but `_apply()` never set it. Drop the reference. The
`config` field was previously stripped from the public API response (per
earlier review item B — `overlap`/`iouThreshold` weren't documented in
DNA), so the SDK never has anything to populate. CLI's matching dead
"Config: overlap=… iouThreshold=…" line in `eval get` also removed.

M4/N1 — `to_dict()` had an untested fallback branch + an awkward
inline-conditional dict-lookup mapping json keys to attr names.
Refactor to a flat `_PUBLIC_FIELDS = ((json_key, attr_name), ...)`
tuple list. Add four tests:
  - round-trips a server payload with `evalId` overlay
  - overlays `evalId` when payload used legacy `id`
  - constructor-only path serialises attrs only, omitting None fields
  - constructor-only path translates Python attr names back to JSON keys

62 tests pass (was 57).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leeclemnet leeclemnet force-pushed the lee/model-evals-cli-sdk branch from aed353e to 361d993 Compare May 7, 2026 18:24
@leeclemnet
Copy link
Copy Markdown
Contributor Author

leeclemnet commented May 7, 2026

@digaobarbosa 361d993

I think the important one is the image predictions properties that are different than the API. to_dict seems like a quick win too.

Item Fix
M1 (BLOCKER) — eval image-predictions table dropped TP/FP/FN Renderer now reads stats.truePositives / falsePositives / falseNegatives (the actual server shape) instead of stats.tp/fp/fn. Cluster column now shows just cluster.id instead of stringifying the whole {id, embedding2D} dict. New regression test asserts non-empty TP/FP/FN cells and that embedding2D doesn't leak into the table.
H2 — ModelEval.config documented but never assigned Removed .config from the class + refresh() docstrings. config was previously stripped from the public API response (per earlier review item B), so the SDK has nothing to populate. CLI's matching dead Config: overlap=… iouThreshold=… line in eval get removed too.
M4 / N1 — to_dict() fallback was untested + complex Refactored to _PUBLIC_FIELDS = ((json_key, attr_name), ...) tuple list. Added 4 tests: round-trip with payload + evalId overlay; legacy id overlay; constructor-only path serialises attrs only with None omitted; constructor-only path translates Python version_id→JSON versionId correctly.

Test count: 57 → 62 (+5 new cases).

Copy link
Copy Markdown
Contributor

@digaobarbosa digaobarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leeclemnet leeclemnet merged commit 8071572 into main May 7, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants