Add model evaluations SDK and CLI#475
Conversation
|
@claude review this PR focusing on 1) alignment of implementation to conceptual API, 2) usability of tools 3) security |
|
|
||
| # -- internal ----------------------------------------------------------- | ||
|
|
||
| def _apply(self, info: Dict[str, Any]) -> None: |
There was a problem hiding this comment.
H2. ModelEval.config is documented but never assigned
- File:
roboflow/core/model_eval.py(around the_applymethod, ~line 30) - Reported by: Claude (HIGH); also noted by OpenAI as Low ("docstring says
.configis populated, butModelEvalnever exposesconfig"). Consensus: two of three reviewers caught it; they disagree on severity. - Detail: The class docstring promises
.configis populated byrefresh(), but_apply()never sets it. Any caller doingev.configraisesAttributeError. The PR description even showsConfig: overlap=30 iouThreshold=50rendered byeval get, suggesting the field exists in the API payload — so the gap is in the SDK plumbing, not the API. - Fix: Add
self.config = info.get("config")in_apply()(and surface it into_dict()), or remove it from the docstring.
There was a problem hiding this comment.
Removed .config from the class + refresh() docstrings. config was previously stripped from the public API response (per earlier review item B), so the SDK has nothing to populate. CLI's matching dead Config: overlap=… iouThreshold=… line in eval get removed too.
| rows.append( | ||
| { | ||
| "image": img.get("imageName", img.get("imageId", "")), | ||
| "split": img.get("split", ""), |
There was a problem hiding this comment.
Blocker This comment from automated review seems right.
Here is the API signature.
M1. eval image-predictions table drops TP/FP/FN
- File:
roboflow/cli/handlers/eval.py(image-predictions human renderer) - Reported by: OpenAI (Medium). Not flagged by Claude or Gemini.
- Detail: The human-readable table reads
tp/fp/fnkeys, but the live API payload (per the PR description's transcript) usestruePositives/falsePositives/falseNegativesnested understats. Result: the TP/FP/FN columns render blank for every row. JSON output via--jsonis unaffected. - Fix: Read from
row["stats"]["truePositives"]etc., consistent with the documented payload shape, and add a CLI test that asserts non-empty TP/FP/FN cells.
There was a problem hiding this comment.
Renderer now reads stats.truePositives / falsePositives / falseNegatives (the actual server shape) instead of stats.tp/fp/fn. Cluster column now shows just cluster.id instead of stringifying the whole {id, embedding2D} dict. New regression test asserts non-empty TP/FP/FN cells and that embedding2D doesn't leak into the table.
|
|
||
| # -- helpers ------------------------------------------------------------ | ||
|
|
||
| def to_dict(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
Two comments that I found relevant
M4 · model_eval.py:to_dict() — no test coverage, non-trivial logic
The fallback branch in to_dict() (lines 139–152) that handles the "constructed with
no info" case uses a dict lookup to map JSON keys to Python attribute names and is
never exercised by any test. Given that this method will be the primary serialisation
path for programmatic users, it should have at minimum:
- A test where
ModelEvalis constructed withinfo=Nonethento_dict()is called - A test where
infois supplied andto_dict()is called (verifies_rawpath)
The logic itself is also overcomplicated — a simple list of (json_key, attr_name)
tuples would be clearer (see Nit N1).
N1 · to_dict() fallback branch is needlessly complex
# model_eval.py lines 139–152
for key in ("status", "project", "versionId", "modelId", "createdAt", "summary"):
attr = (
key
if key in {"status", "project", "summary"}
else {
"versionId": "version_id",
"modelId": "model_id",
"createdAt": "created_at",
}[key]
)A flat list of (json_key, attr_name) pairs is cleaner:
_FIELD_MAP = [
("status", "status"),
("project", "project"),
("versionId", "version_id"),
("modelId", "model_id"),
("createdAt", "created_at"),
("summary", "summary"),
]
for json_key, attr_name in _FIELD_MAP:
value = getattr(self, attr_name, None)
if value is not None:
data[json_key] = valueThere was a problem hiding this comment.
Refactored to _PUBLIC_FIELDS = ((json_key, attr_name), ...) tuple list. Added 4 tests: round-trip with payload + evalId overlay; legacy id overlay; constructor-only path serialises attrs only with None omitted; constructor-only path translates Python version_id→JSON versionId correctly.
digaobarbosa
left a comment
There was a problem hiding this comment.
I think the important one is the image predictions properties that are different than the API.
to_dict seems like a quick win too.
98ca16c to
aed353e
Compare
Wraps the public /{workspace}/model-evals REST surface (roboflow/roboflow#11636)
so users can read evaluation results — mAP, confidence sweep, per-class
performance, confusion matrix, vector clusters, per-image stats, and
recommendations — from Python and from the CLI without hitting the API directly.
SDK:
- Workspace.evals(...) and Workspace.eval(eval_id) accessors return ModelEval
instances; ModelEval has one method per panel returning the raw JSON dict.
- Typed exceptions (ModelEvalNotFoundError, ModelEvalNotDoneError,
InvalidSplitError, InvalidConfidenceError) so callers can distinguish "doesn't
exist" from "still running" from "bad argument" without parsing strings.
CLI: roboflow eval {list, get, map-results, confidence-sweep,
performance-by-class, confusion-matrix, vector-analysis, image-predictions,
recommendations} — every command honors --json. Exit codes are stable per
error class (3=not found, 4=not done, 5=invalid arg).
Tests cover the adapter URL/param plumbing and error mapping (both flat and
nested error envelopes), the ModelEval class, the Workspace accessors, and
each CLI handler's adapter call + error path.
Companion docs in roboflow/roboflow-dev-reference#18.
The REST API returns a single flat shape {"error": "code", "message": "..."}
— the agent's original adapter accepted both flat and nested shapes for
forward-compat, but the nested shape never shipped. Drop the dead branch and
the corresponding test; replace with a status-code-fallback test that exercises
the existing 404/409 fallback paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review on roboflow#11636 affecting the SDK/CLI: - ModelEval._apply reads evalId (legacy id fallback for forward-compat) - to_dict emits evalId - Workspace.evals resolves either field when constructing ModelEval - CLI list/get handlers prefer evalId, fall back to id - Drop the undocumented `config` attribute (not part of public DNA shape) - Tests updated for evalId; 57 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with roboflow#11636 dropping `projectId` from the public response. The SDK previously read `info["projectId"]` (the Firestore doc id) into `ModelEval.project_id`. That field was a doc-id leak — the API now only returns `project` (the URL slug) on the principle that public APIs should not expose storage-layer ids. Rename: `ModelEval.project_id` → `ModelEval.project`. Accept legacy `projectId` from cached older-server responses for forward-compat. CLI list/get handlers also pull from `project` first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues raised in review: M1 (BLOCKER) — `eval image-predictions` table dropped TP/FP/FN. The public API nests those counts under `stats` with camelCase keys (`truePositives`/`falsePositives`/`falseNegatives`); the renderer was reading `stats.tp`/`stats.fp`/`stats.fn` and silently rendering blanks. Same line was also rendering the whole `cluster` object (including the `embedding2D` array) in the cluster column; now renders `cluster.id` only. Live transcript regression case added. H2 — `ModelEval`'s class docstring promised `.config` was populated by `refresh()`, but `_apply()` never set it. Drop the reference. The `config` field was previously stripped from the public API response (per earlier review item B — `overlap`/`iouThreshold` weren't documented in DNA), so the SDK never has anything to populate. CLI's matching dead "Config: overlap=… iouThreshold=…" line in `eval get` also removed. M4/N1 — `to_dict()` had an untested fallback branch + an awkward inline-conditional dict-lookup mapping json keys to attr names. Refactor to a flat `_PUBLIC_FIELDS = ((json_key, attr_name), ...)` tuple list. Add four tests: - round-trips a server payload with `evalId` overlay - overlays `evalId` when payload used legacy `id` - constructor-only path serialises attrs only, omitting None fields - constructor-only path translates Python attr names back to JSON keys 62 tests pass (was 57). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aed353e to
361d993
Compare
Test count: 57 → 62 (+5 new cases). |
Summary
Adds Python SDK + CLI bindings for the new public Model Evaluations REST API. Mirrors every UI panel in the app's evaluation page as a method on
ModelEvaland a subcommand underroboflow eval.This is PR 4 of a 4-PR stack:
Important
Blocked on roboflow#11636 deploying. The CLI/SDK calls REST endpoints that ship in that PR — until 11636 merges and deploys to a given environment, calls return
Unsupported request. GET /:workspace/model-evals does not exist.... For pre-deploy testing, setAPI_URL=https://localapi.roboflow.one(the local dev server hosts 11636).SDK
Typed errors so callers don't parse strings:
ModelEvalNotFoundError,ModelEvalNotDoneError,InvalidSplitError,InvalidConfidenceError(all subclassRoboflowError).CLI
Honors the standard global flags:
--json/-j,--workspace/-w,--api-key/-k,--quiet/-q. Exit codes match error categories:3= not found,4= not done,5= invalid input,2= missing workspace/auth.listandperformance-by-classrender ASCII tables for human output; the dense panels (map-results, confusion-matrix, vector-analysis, image-predictions, recommendations, confidence-sweep) pretty-print JSON since their nested shapes don't tabulate cleanly. Full structured access via--json.Tests
Workspace.evals/evalfilter forwarding.--helpfor every subcommand, list filters, JSON vs text output, exit-code mapping for 404/409/400 across handlers, panel-arg forwarding for all 7 panels.Live verification
Verified against
localapi.roboflow.onewith the test API key against workspacelee-sandbox. All 9 endpoints return correct data through both SDK and CLI. Error paths verified:eval get NOT_A_REAL_ID→ exit 3 withmodel_eval_not_found;eval map-results <running-id>→ exit 4 withmodel_eval_not_done;eval performance-by-class <id> --split all→ exit 5 withinvalid_split;eval confusion-matrix <id> --confidence 999→ exit 5 withinvalid_confidence. SDK round-trip throughWorkspace.evals()/Workspace.eval().panel()confirmed end-to-end including the typedModelEvalNotDoneErrorraise.🤖 Generated with Claude Code
Live verification
Transcript: all 9 subcommands + 4 error paths against local dev server
Exit codes:
0success ·2missing workspace/auth ·3not found ·4not done ·5invalid input.Live verification — staging (
api.roboflow.one)All 9
roboflow evalsubcommands exercised against the staging API on commit6eea8aa. Workspacelee-sandbox, evalhuUF720inUcymARwqAGK. Verbose outputs (confidence-sweep, vector-analysis) are truncated for readability — full payloads available via--jsonfrom the same command.$ roboflow --workspace lee-sandbox eval list --limit 5$ roboflow --workspace lee-sandbox eval get huUF720inUcymARwqAGK$ roboflow --workspace lee-sandbox eval map-results huUF720inUcymARwqAGK$ roboflow --workspace lee-sandbox eval confidence-sweep huUF720inUcymARwqAGK$ roboflow --workspace lee-sandbox eval performance-by-class huUF720inUcymARwqAGK --split test$ roboflow --workspace lee-sandbox eval confusion-matrix huUF720inUcymARwqAGK --split test$ roboflow --workspace lee-sandbox eval vector-analysis huUF720inUcymARwqAGK$ roboflow --workspace lee-sandbox eval image-predictions huUF720inUcymARwqAGK --split test --limit 2$ roboflow --workspace lee-sandbox eval recommendations huUF720inUcymARwqAGK