Skip to content

auth: distinguish "validation failed" from "couldn't validate" in auth profiles#5216

Draft
janniklasrose wants to merge 1 commit intomainfrom
janniklasrose/auth-profile-validate
Draft

auth: distinguish "validation failed" from "couldn't validate" in auth profiles#5216
janniklasrose wants to merge 1 commit intomainfrom
janniklasrose/auth-profile-validate

Conversation

@janniklasrose
Copy link
Copy Markdown
Contributor

Changes

  • databricks auth profiles now reports a three-state result per profile via a new status field (valid, invalid, unknown, unvalidated) plus an optional error describing what went wrong.
  • 401 → invalid with a re-login hint, 403 → invalid with a permissions hint, 5xx / network failure / timeout → unknown (the profile may be fine, the platform isn't reachable). --skip-validate reports unvalidated.
  • The legacy valid field is now *bool and emitted only when the answer is conclusive: true on success, false for proven-bad profiles, omitted for transient/unknown cases that previously misreported as valid: false.
  • Each profile validation is bounded by a 10s per-call timeout, and the same value is wired into cfg.HTTPTimeoutSeconds / cfg.RetryTimeoutSeconds so a hung host-metadata fetch in EnsureResolved (which runs on context.Background internally) can't stall the listing past that ceiling.
  • Text-output column Valid now renders valid (green) / invalid (red) / unknown (yellow) / - instead of YES/NO.

Why

A single valid: false was overloaded across very different conditions — expired token, no network, 5xx from the workspace, malformed config, missing host. Users seeing red NO everywhere had no way to tell whether to re-login, check the VPN, or fix the file. Three-state reporting plus an error string gives them an actionable signal. The legacy field is preserved (as *bool) for the optimistic-path consumers we know about; scripts that branched on valid: false will start seeing the field absent for the cases where we genuinely couldn't tell.

The per-profile timeout keeps auth profiles snappy: previously a single dead host would retry for ~5 minutes against the SDK's default RetryTimeoutSeconds, blocking the whole listing.

Tests

  • New unit tests in cmd/auth/profiles_test.go:
    • TestClassifyValidationError — table-driven coverage of the error → (status, message) mapping (nil, deadline-exceeded, 401, 403, 500, 503, network, fallthrough).
    • TestProfileLoadStatusMatrix — integration via httptest: 401, 403, 500, network-down (closed server), InvalidConfig (via experimental_is_unified_host=true without an account ID), and --skip-validate.
  • Existing TestProfileLoadSPOGConfigType / TestProfileLoadNoDiscoveryStaysWorkspace updated to assert against Status instead of the now-pointer Valid.
  • New acceptance tests under acceptance/cmd/auth/profiles/:
    • expired-token/ — 401 from the SCIM /Me endpoint produces status: "invalid", valid: false, and the remediation hint.
    • server-error/ — 500 produces status: "unknown" with valid omitted.
  • Regenerated affected acceptance outputs (the existing profiles/spog-account test plus tests in auth/login, auth/logout, auth/switch, and auth/host-metadata-cache that include auth profiles output).
  • ./task fmt, ./task checks, ./task lint, and the relevant unit + acceptance test runs all pass locally. The only failing tests on this machine (bundle/deploy/spark-jar-task, apps/runlocal/TestNodeApp*) are unrelated environment issues — missing Java runtime and NODE_OPTIONS injected by my dev shell.

This PR was written by Claude Code.

…th profiles`

Replace the binary `valid: true|false` with a three-state `status` (valid /
invalid / unknown) plus an optional `error` description, so users can tell
an expired token apart from a network blip. The legacy `valid` field is now
a `*bool`: emitted only when the result is conclusive, omitted for transient
errors that previously misreported as `valid: false`. Adds a 10s per-profile
validation timeout (also bounding `EnsureResolved`'s host-metadata fetch via
`HTTPTimeoutSeconds`/`RetryTimeoutSeconds`) so a single dead host no longer
stalls the whole listing.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Approval status: pending

/acceptance/auth/ - needs approval

Files: acceptance/auth/host-metadata-cache/output.txt
Suggested: @simonfaltum
Also eligible: @mihaimitrea-db, @tanmay-db, @tejaskochar-db, @renaudhartert-db, @hectorcast-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/cmd/auth/ - needs approval

Files: cmd/auth/profiles.go, cmd/auth/profiles_test.go
Suggested: @simonfaltum
Also eligible: @mihaimitrea-db, @tanmay-db, @tejaskochar-db, @renaudhartert-db, @hectorcast-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

General files (require maintainer)

15 files changed
Based on git history:

  • @simonfaltum -- recent work in ./, cmd/auth/, acceptance/cmd/auth/profiles/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@janniklasrose janniklasrose requested a review from simonfaltum May 8, 2026 11:15
@janniklasrose janniklasrose marked this pull request as draft May 8, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant