Add model evaluations SDK and CLI by leeclemnet · Pull Request #475 · roboflow/roboflow-python

leeclemnet · 2026-05-07T14:27:51Z

Summary

Adds Python SDK + CLI bindings for the new public Model Evaluations REST API. Mirrors every UI panel in the app's evaluation page as a method on ModelEval and a subcommand under roboflow eval.

This is PR 4 of a 4-PR stack:

#	PR	Repo	Status
1	DNA edits	roboflow	✅ merged (roboflow#11602)
2	REST API implementation + tests	roboflow	open (roboflow#11636)
3	MCP tools	roboflow-mcp	open (roboflow-mcp#36)
4	CLI + SDK + tests (this PR)	roboflow-python	open
docs	Public-facing reference	roboflow-dev-reference	open (dev-reference#18)

Important

Blocked on roboflow#11636 deploying. The CLI/SDK calls REST endpoints that ship in that PR — until 11636 merges and deploys to a given environment, calls return Unsupported request. GET /:workspace/model-evals does not exist.... For pre-deploy testing, set API_URL=https://localapi.roboflow.one (the local dev server hosts 11636).

SDK

import roboflow
ws = roboflow.Roboflow(api_key=...).workspace("my-workspace")

# List + get
evals = ws.evals(status="done", limit=10)        # → list[ModelEval]
e = ws.eval("huUF720inUcymARwqAGK")              # → ModelEval

# Headline metrics on done evals (no extra round-trip)
e.summary  # → {"mAP": 0.92, "precision": 0.85, "recall": 0.85}

# Panel data — one method per UI panel
e.map_results()
e.confidence_sweep()
e.performance_by_class(split="test")
e.confusion_matrix(split="test", confidence=20)
e.vector_analysis(confidence=20)
e.image_predictions(split="test", limit=200)
e.recommendations()

Typed errors so callers don't parse strings: ModelEvalNotFoundError, ModelEvalNotDoneError, InvalidSplitError, InvalidConfidenceError (all subclass RoboflowError).

CLI

roboflow eval list [-p PROJECT] [-v VERSION] [-m MODEL] [-s STATUS] [-n LIMIT]
roboflow eval get <eval_id>
roboflow eval map-results <eval_id>
roboflow eval confidence-sweep <eval_id>
roboflow eval performance-by-class <eval_id> [-s SPLIT]
roboflow eval confusion-matrix <eval_id> [-s SPLIT] [-c CONFIDENCE]
roboflow eval vector-analysis <eval_id> [-c CONFIDENCE]
roboflow eval image-predictions <eval_id> [-s SPLIT] [-c CONF] [-n LIMIT] [-o OFFSET]
roboflow eval recommendations <eval_id>

Honors the standard global flags: --json/-j, --workspace/-w, --api-key/-k, --quiet/-q. Exit codes match error categories: 3 = not found, 4 = not done, 5 = invalid input, 2 = missing workspace/auth.

list and performance-by-class render ASCII tables for human output; the dense panels (map-results, confusion-matrix, vector-analysis, image-predictions, recommendations, confidence-sweep) pretty-print JSON since their nested shapes don't tabulate cleanly. Full structured access via --json.

Tests

57 new tests (full suite: 610 passed / 1 skipped).
Adapter tests (20): URL/param plumbing for each endpoint, all four typed errors, status-code fallback for unknown 404s, non-JSON body fallback.
SDK tests (18): construction with/without info, refresh chainability, each panel method's adapter delegation, error propagation, all Workspace.evals/eval filter forwarding.
CLI tests (19): registration / --help for every subcommand, list filters, JSON vs text output, exit-code mapping for 404/409/400 across handlers, panel-arg forwarding for all 7 panels.

Live verification

Verified against localapi.roboflow.one with the test API key against workspace lee-sandbox. All 9 endpoints return correct data through both SDK and CLI. Error paths verified: eval get NOT_A_REAL_ID → exit 3 with model_eval_not_found; eval map-results <running-id> → exit 4 with model_eval_not_done; eval performance-by-class <id> --split all → exit 5 with invalid_split; eval confusion-matrix <id> --confidence 999 → exit 5 with invalid_confidence. SDK round-trip through Workspace.evals() / Workspace.eval().panel() confirmed end-to-end including the typed ModelEvalNotDoneError raise.

🤖 Generated with Claude Code

Live verification

Transcript: all 9 subcommands + 4 error paths against local dev server

══════════════════════════════════════════════════════════════════════════
 roboflow eval — CLI transcript (workspace: lee-sandbox)
   $ export API_URL=https://localapi.roboflow.one
   $ export ROBOFLOW_API_KEY=…
══════════════════════════════════════════════════════════════════════════

── $ roboflow eval list --workspace lee-sandbox --limit 3
ID                    STATUS  PROJECT               VERSION  MODEL                 CREATED
--------------------  ------  --------------------  -------  --------------------  ------------------------
huUF720inUcymARwqAGK  done    kBRjxrPAMU4f9xmR16ws  4                              2026-04-27T20:04:10.904Z
ViZ1tozq4XBDt0yjYPnF  done    7vSK9kc3i6eFm700KWFt  63                             2026-04-24T21:39:11.238Z
342EYOewHnu8J6HIzvNy  done    7vSK9kc3i6eFm700KWFt  60       Z9HwYHqrpMMoCtmAo4pO  2026-04-24T21:34:08.958Z

── $ roboflow eval list --workspace lee-sandbox --status done --limit 2
ID                    STATUS  PROJECT               VERSION  MODEL  CREATED
--------------------  ------  --------------------  -------  -----  ------------------------
huUF720inUcymARwqAGK  done    kBRjxrPAMU4f9xmR16ws  4               2026-04-27T20:04:10.904Z
ViZ1tozq4XBDt0yjYPnF  done    7vSK9kc3i6eFm700KWFt  63              2026-04-24T21:39:11.238Z

── $ roboflow eval get huUF720inUcymARwqAGK --workspace lee-sandbox
Eval: huUF720inUcymARwqAGK
  Status:  done
  Project: kBRjxrPAMU4f9xmR16ws
  Version: 4
  Model:   (none)
  Created: 2026-04-27T20:04:10.904Z
  Config:  overlap=30 iouThreshold=50
  Summary: mAP=0.9239650566041828 precision=0.85 recall=0.85

── $ roboflow eval get huUF720inUcymARwqAGK --workspace lee-sandbox --json | jq '.summary'
{
  "mAP": 0.9239650566041828,
  "precision": 0.85,
  "recall": 0.85
}

── $ roboflow eval performance-by-class huUF720inUcymARwqAGK --workspace lee-sandbox
Split: test
CLASS       mAP50   mAP50-95  mAP75   P       R       F1      OPT_THR
----------  ------  --------  ------  ------  ------  ------  -------
Car-rims    0.9240  0.7555    0.9240  0.8500  0.8500  0.8500  0.3700
music-note                            0.0000  0.0000  0.0000  0.5000

── $ roboflow eval map-results huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.splits.test | {map50,map50_95,map75}'
{
  "map50": 0.9239650566041828,
  "map50_95": 0.7555258345429926,
  "map75": 0.9239650566041828
}

── $ roboflow eval confidence-sweep huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.splits.test.optimalThreshold,.splits.test.optimalMetrics'
0.37
{
  "precision": 0.85,
  "recall": 0.85,
  "f1": 0.85
}

── $ roboflow eval confusion-matrix huUF720inUcymARwqAGK -w lee-sandbox --json | jq '{split,confidenceThreshold,classes,matrix}'
{
  "split": "all",
  "confidenceThreshold": 0.2,
  "classes": ["Car-rims", "music-note", "background"],
  "matrix": [
    [ 382,   0,   2],
    [   0,   0,   5],
    [1336,   1,   0]
  ]
}

── $ roboflow eval vector-analysis huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.clustering'
{
  "method": "hdbscan",
  "nClusters": 54,
  "metrics": { "noiseRatio": 0.078, "silhouetteScore": 0.489 },
  "parameters": {
    "min_cluster_size": 2,
    "min_samples": 1,
    "cluster_selection_method": "eom",
    "metric": "euclidean"
  },
  "processingTimeSeconds": 8.36
}
# .clusters has 55 entries (54 + the noise bucket id=-1)
# each cluster: { id, numImages, splitDistribution, metrics: {f1Mean, f1Std, f1Min, f1Max, precisionMean, recallMean}, sampleImages: [...] }

── $ roboflow eval image-predictions huUF720inUcymARwqAGK -w lee-sandbox --split test --limit 1 --json | jq '.images[0]'
{
  "imageId": "1QKLCUsfAzFiCIb6YCJj",
  "imageName": "-B59BC424-…-png_jpg.rf.b1444301….jpg",
  "split": "test",
  "augmentations": 2,
  "cluster": { "id": 4, "embedding2D": [7.495, -5.144] },
  "stats": {
    "truePositives": 2, "falsePositives": 7, "falseNegatives": 0,
    "precision": 0.222, "recall": 1.0, "f1": 0.364
  },
  "confusion": [[0, 0, 2], [2, 0, 7]]
}

── $ roboflow eval recommendations huUF720inUcymARwqAGK -w lee-sandbox --json | jq '.recommendations.summary'
{
  "confidenceThreshold": 37,
  "split": "test",
  "generatedAt": "2026-04-27T20:05:37.512Z",
  "count": 3,
  "f1": 0.85,
  "precision": 0.85,
  "recall": 0.85
}
# .recommendations.items has 3 entries:
#   { type: "missed_detection",  analysis: { affected_class: "Car-rims", count: 3 } }
#   { type: "class_imbalance",   analysis: { affected_class: "Car-rims", current_count: 20, … } }
#   …

═══════════════════════════════════════════════════════════════════════════
                              ERROR PATHS
═══════════════════════════════════════════════════════════════════════════

── $ roboflow eval get NOT_REAL --workspace lee-sandbox
Error: Model evaluation not found
  Hint: Run 'roboflow eval list' to see eval ids in this workspace.
$? = 3

── $ roboflow eval map-results fNyWx6PC74rCc18IuZ3M --workspace lee-sandbox   # running eval
Error: Model evaluation has not completed
  Hint: Wait for the eval to finish (status='done') before reading panel data.
$? = 4

── $ roboflow eval performance-by-class huUF720inUcymARwqAGK -w lee-sandbox --split all
Error: Invalid split: must be one of train, valid, test
  Hint: Use one of: train, valid, test (or 'all' where supported).
$? = 5

── $ roboflow eval confusion-matrix huUF720inUcymARwqAGK -w lee-sandbox --confidence 999
Error: Invalid confidence threshold: must be an integer percentage in [0, 100]
  Hint: Pass an integer between 0 and 100.
$? = 5

Exit codes: 0 success · 2 missing workspace/auth · 3 not found · 4 not done · 5 invalid input.

Live verification — staging (`api.roboflow.one`)

All 9 roboflow eval subcommands exercised against the staging API on commit 6eea8aa. Workspace lee-sandbox, eval huUF720inUcymARwqAGK. Verbose outputs (confidence-sweep, vector-analysis) are truncated for readability — full payloads available via --json from the same command.

$ roboflow --workspace lee-sandbox eval list --limit 5

ID                    STATUS  PROJECT               VERSION  MODEL                 CREATED                 
--------------------  ------  --------------------  -------  --------------------  ------------------------
huUF720inUcymARwqAGK  done    kBRjxrPAMU4f9xmR16ws  4                              2026-04-27T20:04:10.904Z
ViZ1tozq4XBDt0yjYPnF  done    7vSK9kc3i6eFm700KWFt  63                             2026-04-24T21:39:11.238Z
342EYOewHnu8J6HIzvNy  done    7vSK9kc3i6eFm700KWFt  60       Z9HwYHqrpMMoCtmAo4pO  2026-04-24T21:34:08.958Z
pX6mnyL185mgbKM6V49G  done    7vSK9kc3i6eFm700KWFt  62                             2026-04-24T21:33:19.025Z
UcVhAMxwiV9yctz5QBNM  done    7vSK9kc3i6eFm700KWFt  61                             2026-04-24T20:58:07.978Z

$ roboflow --workspace lee-sandbox eval get huUF720inUcymARwqAGK

Eval: huUF720inUcymARwqAGK
  Status:  done
  Project: kBRjxrPAMU4f9xmR16ws
  Version: 4
  Model:   (none)
  Created: 2026-04-27T20:04:10.904Z
  Summary: mAP=0.9239650566041828 precision=0.85 recall=0.85

$ roboflow --workspace lee-sandbox eval map-results huUF720inUcymARwqAGK

{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "splits": {
    "train": {
      "map50": 0.42346596111761736,
      "map50_95": 0.3144300021480774,
      "map75": 0.3746226767482993,
      "byObjectSize": {
        "small": {
          "map50": 0.43721019833036057,
          "map50_95": 0.28794722150177027,
          "map75": 0.3551395358136276
        },
        "medium": {
          "map50": 0.4586843319371667,
          "map50_95": 0.35248557062640906,
          "map75": 0.4150174709843547
        },
        "large": null
      },
      "perClass": {
        "Car-rims": {
          "map50": 0.8469319222352347,
          "map50_95": 0.6288600042961547,
          "map75": 0.7492453534965983,
          "byObjectSize": {
            "small": {
              "map50": 0.8744203966607211,
              "map50_95": 0.5758944430035405,
              "map75": 0.7102790716272552
            },
            "medium": {
              "map50": 0.9173686638743335,
              "map50_95": 0.7049711412528181,
              "map75": 0.8300349419687095
            },
            "large": null
          }
        },
        "music-note": {
          "map50": 0,
          "map50_95": 0,
          "map75": 0,
          "byObjectSize": {
            "small": {
              "map50": 0,
              "map50_95": 0,
              "map75": 0
            },
            "medium": {
              "map50": 0,
              "map50_95": 0,
              "map75": 0
            },
            "large": null
          }
        }

[... 100 more lines truncated ...]

$ roboflow --workspace lee-sandbox eval confidence-sweep huUF720inUcymARwqAGK

{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "splits": {
    "train": {
      "perThreshold": {
        "0.00": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.01": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.02": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.03": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.04": {
          "precision": 0.01,
          "recall": 0.5,
          "f1": 0.02
        },
        "0.05": {
          "precision": 0.011,
          "recall": 0.5,
          "f1": 0.021
        },
        "0.06": {
          "precision": 0.011,

[... 4609 more lines truncated ...]

$ roboflow --workspace lee-sandbox eval performance-by-class huUF720inUcymARwqAGK --split test

Split: test
CLASS       mAP50   mAP50-95  mAP75   P       R       F1      OPT_THR
----------  ------  --------  ------  ------  ------  ------  -------
Car-rims    0.9240  0.7555    0.9240  0.8500  0.8500  0.8500  0.3700 
music-note                            0.0000  0.0000  0.0000  0.5000

$ roboflow --workspace lee-sandbox eval confusion-matrix huUF720inUcymARwqAGK --split test

Split: test  Confidence: 0.2
{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "split": "test",
  "confidenceThreshold": 0.2,
  "classes": [
    "Car-rims",
    "music-note",
    "background"
  ],
  "matrix": [
    [
      20,
      0,
      0
    ],
    [
      0,
      0,
      0
    ],
    [
      80,
      0,
      0
    ]
  ]
}

$ roboflow --workspace lee-sandbox eval vector-analysis huUF720inUcymARwqAGK

{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "clustering": {
    "method": "hdbscan",
    "nClusters": 54,
    "metrics": {
      "noiseRatio": 0.078125,
      "silhouetteScore": 0.48925095796585083
    },
    "parameters": {
      "min_cluster_size": 2,
      "min_samples": 1,
      "cluster_selection_method": "eom",
      "metric": "euclidean"
    },
    "processingTimeSeconds": 8.360810041427612
  },
  "preprocessing": {
    "method": "umap",
    "originalDimensions": 768,
    "targetDimensions": 10,
    "nNeighbors": 30,
    "minDistance": 0.05
  },
  "clusters": [
    {
      "id": -1,
      "numImages": 15,
      "splitDistribution": {
        "train": 12,
        "valid": 2,
        "test": 1
      },
      "metrics": {
        "f1Mean": 0.46213333333333334,
        "f1Std": 0.2193821222332293,
        "f1Min": 0.129,
        "f1Max": 0.8,
        "precisionMean": 0.3301333333333334,
        "recallMean": 0.9524
      },
      "sampleImages": [
        "-3372F3CB-7432-4FE7-B9BB-DFB5D1FC3CD4-png_jpg.rf.161520bef3c23c613d2cb40845aadfc4.jpg",
        "13_jpeg_jpg.rf.03b2557177aea66e7c6dd0460ff2b4be.jpg",
        "eyJidWNrZXQiOiJkb25lZGVhbC5pZS1waG90b3MiLCJlZGl0cyI6eyJ0b0Zvcm1hdCI6ImpwZWciLCJyZXNpemUiOnsiZml0IjoiY292ZXIiLCJ3aWR0aCI6NjAwLCJoZWlnaHQiOjQ1MH19LCJrZXkiOiJwaG90b18yNTczNTQ4ODUifQ-_jpeg_jpg.rf.b4d397020d5343286735e37ca13d4176.jpg",
        "eyJidWNrZXQiOiJkb25lZGVhbC5pZS1waG90b3MiLCJlZGl0cyI6eyJ0b0Zvcm1hdCI6ImpwZWciLCJyZXNpemUiOnsiZml0IjoiY292ZXIiLCJ3aWR0aCI6NjAwLCJoZWlnaHQiOjQ1MH19LCJrZXkiOiJwaG90b18yNTczNTQ4ODUifQ-_jpeg_jpg.rf.e37e70360ac16b4f686171de52a82162.jpg",
        "11_jpeg_jpg.rf.188e9f7f9deb715cb6a180bb091babaf.jpg"
      ]
    },
    {
      "id": 0,
      "numImages": 3,
      "splitDistribution": {
        "train": 2,
        "valid": 1
      },
      "metrics": {

[... 1097 more lines truncated ...]

$ roboflow --workspace lee-sandbox eval image-predictions huUF720inUcymARwqAGK --split test --limit 2

Split: test  Confidence: 0.2  Total: 10  Offset: 0  Limit: 2
IMAGE                              SPLIT  TP  FP  FN  CLUSTER                                                           
---------------------------------  -----  --  --  --  ------------------------------------------------------------------
-B59BC424-0AB7-4880-82A8-54AC24C…  test               {'id': 4, 'embedding2D': [7.494518280029297, -5.143994331359863]} 
anne-nygard-6t7wZAU3nuY-unsplash…  test               {'id': 10, 'embedding2D': [8.974885940551758, 10.444429397583008]}

$ roboflow --workspace lee-sandbox eval recommendations huUF720inUcymARwqAGK

{
  "evalId": "huUF720inUcymARwqAGK",
  "projectId": "kBRjxrPAMU4f9xmR16ws",
  "versionId": "4",
  "modelId": null,
  "generated": true,
  "generatedAt": "2026-04-27T20:05:37.512Z",
  "recommendations": {
    "summary": {
      "confidenceThreshold": 37,
      "split": "test",
      "generatedAt": "2026-04-27T20:05:37.512Z",
      "count": 3,
      "f1": 0.85,
      "precision": 0.85,
      "recall": 0.85
    },
    "items": [
      {
        "id": "56bcd423-38ff-45f9-b3e0-662a71ce44e6",
        "type": "missed_detection",
        "analysis": {
          "affected_class": "Car-rims",
          "count": 3
        }
      },
      {
        "id": "150e49a8-3a61-479a-9e18-3eb751494a70",
        "type": "class_imbalance",
        "analysis": {
          "affected_class": "Car-rims",
          "current_count": 20,
          "total_gt_instances": 20,
          "median_count": 10,
          "min_count_threshold": 30,
          "relative_ratio_threshold": 3,
          "violates_min_count": true,
          "violates_relative_ratio": false,
          "all_imbalanced_classes": [
            {
              "class_name": "Car-rims",
              "count": 20
            }
          ]
        }
      },
      {
        "id": "f8964b2c-a5fa-47c2-9775-24c72de954b7",
        "type": "dataset_health",
        "analysis": {
          "split": "test",
          "current_count": 10,
          "total_images": 192,
          "current_percentage": 5,
          "min_absolute_size": 50,
          "min_percentage": 5
        }
      }
    ]
  }

[... 1 more lines truncated ...]

leeclemnet · 2026-05-07T15:01:33Z

@claude review this PR focusing on 1) alignment of implementation to conceptual API, 2) usability of tools 3) security

digaobarbosa · 2026-05-07T17:55:39Z

+
+    # -- internal -----------------------------------------------------------
+
+    def _apply(self, info: Dict[str, Any]) -> None:


H2. ModelEval.config is documented but never assigned

File: roboflow/core/model_eval.py (around the _apply method, ~line 30)

Reported by: Claude (HIGH); also noted by OpenAI as Low ("docstring says .config is populated, but ModelEval never exposes config"). Consensus: two of three reviewers caught it; they disagree on severity.

Detail: The class docstring promises .config is populated by refresh(), but _apply() never sets it. Any caller doing ev.config raises AttributeError. The PR description even shows Config: overlap=30 iouThreshold=50 rendered by eval get, suggesting the field exists in the API payload — so the gap is in the SDK plumbing, not the API.

Fix: Add self.config = info.get("config") in _apply() (and surface it in to_dict()), or remove it from the docstring.

Removed .config from the class + refresh() docstrings. config was previously stripped from the public API response (per earlier review item B), so the SDK has nothing to populate. CLI's matching dead Config: overlap=… iouThreshold=… line in eval get removed too.

digaobarbosa · 2026-05-07T17:59:42Z

+        rows.append(
+            {
+                "image": img.get("imageName", img.get("imageId", "")),
+                "split": img.get("split", ""),


Blocker This comment from automated review seems right.
Here is the API signature.

https://github.com/roboflow/roboflow/pull/11636/changes#diff-59f36a589c8b8f6567999ddf5a655fc2fe2a4dfe75c630178f86460b994aeb52R921

M1. eval image-predictions table drops TP/FP/FN

File: roboflow/cli/handlers/eval.py (image-predictions human renderer)

Reported by: OpenAI (Medium). Not flagged by Claude or Gemini.

Detail: The human-readable table reads tp/fp/fn keys, but the live API payload (per the PR description's transcript) uses truePositives/falsePositives/falseNegatives nested under stats. Result: the TP/FP/FN columns render blank for every row. JSON output via --json is unaffected.

Fix: Read from row["stats"]["truePositives"] etc., consistent with the documented payload shape, and add a CLI test that asserts non-empty TP/FP/FN cells.

Renderer now reads stats.truePositives / falsePositives / falseNegatives (the actual server shape) instead of stats.tp/fp/fn. Cluster column now shows just cluster.id instead of stringifying the whole {id, embedding2D} dict. New regression test asserts non-empty TP/FP/FN cells and that embedding2D doesn't leak into the table.

digaobarbosa · 2026-05-07T18:05:47Z

+
+    # -- helpers ------------------------------------------------------------
+
+    def to_dict(self) -> Dict[str, Any]:


Two comments that I found relevant

M4 · model_eval.py:to_dict() — no test coverage, non-trivial logic

The fallback branch in to_dict() (lines 139–152) that handles the "constructed with
no info" case uses a dict lookup to map JSON keys to Python attribute names and is
never exercised by any test. Given that this method will be the primary serialisation
path for programmatic users, it should have at minimum:

A test where ModelEval is constructed with info=None then to_dict() is called

A test where info is supplied and to_dict() is called (verifies _raw path)

The logic itself is also overcomplicated — a simple list of (json_key, attr_name)
tuples would be clearer (see Nit N1).

N1 · to_dict() fallback branch is needlessly complex

# model_eval.py lines 139–152 for key in ("status", "project", "versionId", "modelId", "createdAt", "summary"): attr = ( key if key in {"status", "project", "summary"} else { "versionId": "version_id", "modelId": "model_id", "createdAt": "created_at", }[key] )

A flat list of (json_key, attr_name) pairs is cleaner:

_FIELD_MAP = [ ("status", "status"), ("project", "project"), ("versionId", "version_id"), ("modelId", "model_id"), ("createdAt", "created_at"), ("summary", "summary"), ] for json_key, attr_name in _FIELD_MAP: value = getattr(self, attr_name, None) if value is not None: data[json_key] = value

Refactored to _PUBLIC_FIELDS = ((json_key, attr_name), ...) tuple list. Added 4 tests: round-trip with payload + evalId overlay; legacy id overlay; constructor-only path serialises attrs only with None omitted; constructor-only path translates Python version_id→JSON versionId correctly.

digaobarbosa

I think the important one is the image predictions properties that are different than the API.
to_dict seems like a quick win too.

Wraps the public /{workspace}/model-evals REST surface (roboflow/roboflow#11636) so users can read evaluation results — mAP, confidence sweep, per-class performance, confusion matrix, vector clusters, per-image stats, and recommendations — from Python and from the CLI without hitting the API directly. SDK: - Workspace.evals(...) and Workspace.eval(eval_id) accessors return ModelEval instances; ModelEval has one method per panel returning the raw JSON dict. - Typed exceptions (ModelEvalNotFoundError, ModelEvalNotDoneError, InvalidSplitError, InvalidConfidenceError) so callers can distinguish "doesn't exist" from "still running" from "bad argument" without parsing strings. CLI: roboflow eval {list, get, map-results, confidence-sweep, performance-by-class, confusion-matrix, vector-analysis, image-predictions, recommendations} — every command honors --json. Exit codes are stable per error class (3=not found, 4=not done, 5=invalid arg). Tests cover the adapter URL/param plumbing and error mapping (both flat and nested error envelopes), the ModelEval class, the Workspace accessors, and each CLI handler's adapter call + error path. Companion docs in roboflow/roboflow-dev-reference#18.

The REST API returns a single flat shape {"error": "code", "message": "..."} — the agent's original adapter accepted both flat and nested shapes for forward-compat, but the nested shape never shipped. Drop the dead branch and the corresponding test; replace with a status-code-fallback test that exercises the existing 404/409 fallback paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR review on roboflow#11636 affecting the SDK/CLI: - ModelEval._apply reads evalId (legacy id fallback for forward-compat) - to_dict emits evalId - Workspace.evals resolves either field when constructing ModelEval - CLI list/get handlers prefer evalId, fall back to id - Drop the undocumented `config` attribute (not part of public DNA shape) - Tests updated for evalId; 57 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pairs with roboflow#11636 dropping `projectId` from the public response. The SDK previously read `info["projectId"]` (the Firestore doc id) into `ModelEval.project_id`. That field was a doc-id leak — the API now only returns `project` (the URL slug) on the principle that public APIs should not expose storage-layer ids. Rename: `ModelEval.project_id` → `ModelEval.project`. Accept legacy `projectId` from cached older-server responses for forward-compat. CLI list/get handlers also pull from `project` first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three issues raised in review: M1 (BLOCKER) — `eval image-predictions` table dropped TP/FP/FN. The public API nests those counts under `stats` with camelCase keys (`truePositives`/`falsePositives`/`falseNegatives`); the renderer was reading `stats.tp`/`stats.fp`/`stats.fn` and silently rendering blanks. Same line was also rendering the whole `cluster` object (including the `embedding2D` array) in the cluster column; now renders `cluster.id` only. Live transcript regression case added. H2 — `ModelEval`'s class docstring promised `.config` was populated by `refresh()`, but `_apply()` never set it. Drop the reference. The `config` field was previously stripped from the public API response (per earlier review item B — `overlap`/`iouThreshold` weren't documented in DNA), so the SDK never has anything to populate. CLI's matching dead "Config: overlap=… iouThreshold=…" line in `eval get` also removed. M4/N1 — `to_dict()` had an untested fallback branch + an awkward inline-conditional dict-lookup mapping json keys to attr names. Refactor to a flat `_PUBLIC_FIELDS = ((json_key, attr_name), ...)` tuple list. Add four tests: - round-trips a server payload with `evalId` overlay - overlays `evalId` when payload used legacy `id` - constructor-only path serialises attrs only, omitting None fields - constructor-only path translates Python attr names back to JSON keys 62 tests pass (was 57). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

leeclemnet · 2026-05-07T18:28:54Z

@digaobarbosa 361d993

I think the important one is the image predictions properties that are different than the API. to_dict seems like a quick win too.

Item	Fix
M1 (BLOCKER) — eval image-predictions table dropped TP/FP/FN	Renderer now reads stats.truePositives / falsePositives / falseNegatives (the actual server shape) instead of stats.tp/fp/fn. Cluster column now shows just cluster.id instead of stringifying the whole {id, embedding2D} dict. New regression test asserts non-empty TP/FP/FN cells and that embedding2D doesn't leak into the table.
H2 — ModelEval.config documented but never assigned	Removed .config from the class + refresh() docstrings. config was previously stripped from the public API response (per earlier review item B), so the SDK has nothing to populate. CLI's matching dead Config: overlap=… iouThreshold=… line in eval get removed too.
M4 / N1 — to_dict() fallback was untested + complex	Refactored to _PUBLIC_FIELDS = ((json_key, attr_name), ...) tuple list. Added 4 tests: round-trip with payload + evalId overlay; legacy id overlay; constructor-only path serialises attrs only with None omitted; constructor-only path translates Python version_id→JSON versionId correctly.

Test count: 57 → 62 (+5 new cases).

digaobarbosa

LGTM

digaobarbosa reviewed May 7, 2026

View reviewed changes

digaobarbosa requested changes May 7, 2026

View reviewed changes

leeclemnet force-pushed the lee/model-evals-cli-sdk branch from 98ca16c to aed353e Compare May 7, 2026 18:12

leeclemnet and others added 5 commits May 7, 2026 14:23

leeclemnet force-pushed the lee/model-evals-cli-sdk branch from aed353e to 361d993 Compare May 7, 2026 18:24

digaobarbosa approved these changes May 7, 2026

View reviewed changes

leeclemnet merged commit 8071572 into main May 7, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model evaluations SDK and CLI#475

Add model evaluations SDK and CLI#475
leeclemnet merged 5 commits intomainfrom
lee/model-evals-cli-sdk

leeclemnet commented May 7, 2026 •

edited

Loading

Uh oh!

leeclemnet commented May 7, 2026

Uh oh!

digaobarbosa May 7, 2026

Uh oh!

leeclemnet May 7, 2026

Uh oh!

digaobarbosa May 7, 2026

Uh oh!

leeclemnet May 7, 2026

Uh oh!

digaobarbosa May 7, 2026

Uh oh!

leeclemnet May 7, 2026

Uh oh!

digaobarbosa left a comment

Uh oh!

leeclemnet commented May 7, 2026 •

edited

Loading

Uh oh!

digaobarbosa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		# -- internal -----------------------------------------------------------

		def _apply(self, info: Dict[str, Any]) -> None:


		# -- helpers ------------------------------------------------------------

		def to_dict(self) -> Dict[str, Any]:

Conversation

leeclemnet commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

SDK

CLI

Tests

Live verification

Live verification

Live verification — staging (api.roboflow.one)

Uh oh!

leeclemnet commented May 7, 2026

Uh oh!

digaobarbosa May 7, 2026

Choose a reason for hiding this comment

H2. ModelEval.config is documented but never assigned

Uh oh!

leeclemnet May 7, 2026

Choose a reason for hiding this comment

Uh oh!

digaobarbosa May 7, 2026

Choose a reason for hiding this comment

M1. eval image-predictions table drops TP/FP/FN

Uh oh!

leeclemnet May 7, 2026

Choose a reason for hiding this comment

Uh oh!

digaobarbosa May 7, 2026

Choose a reason for hiding this comment

M4 · model_eval.py:to_dict() — no test coverage, non-trivial logic

N1 · to_dict() fallback branch is needlessly complex

Uh oh!

leeclemnet May 7, 2026

Choose a reason for hiding this comment

Uh oh!

digaobarbosa left a comment

Choose a reason for hiding this comment

Uh oh!

leeclemnet commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

digaobarbosa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leeclemnet commented May 7, 2026 •

edited

Loading

Live verification — staging (`api.roboflow.one`)

H2. `ModelEval.config` is documented but never assigned

M1. `eval image-predictions` table drops TP/FP/FN

M4 · `model_eval.py:to_dict()` — no test coverage, non-trivial logic

N1 · `to_dict()` fallback branch is needlessly complex

leeclemnet commented May 7, 2026 •

edited

Loading