feat(gateway): add DB-backed resource_version CAS for stored objects

## Problem Statement

OpenShell's generic object store currently persists top-level objects by rewriting the full protobuf payload on `put_message`, with the database only enforcing uniqueness on `id` and `(object_type, name)`. That works for a single writer, but it is not safe once multiple gateway instances can mutate the same stored object.

PR #1242 adds sandbox provider attach/detach and currently relies on whole-object sandbox mutation. A single-gateway mutex can reduce same-process races, but it does not protect HA writers, compute watch/reconcile updates, or other gateway code paths that concurrently rewrite the same sandbox row.

The result is classic lost-update behavior: the last full-payload write wins, even if it was based on stale state. We need a DB-backed compare-and-swap/resource-version mechanism at the object-store boundary so correctness does not depend on in-process locks.

## Technical Context

### What the spike found

- The gateway persistence layer stores opaque protobuf payloads with a narrow indexed schema.
- Generic object writes go through unconditional overwrite helpers:
  - `Store::put(...)` in `crates/openshell-server/src/persistence/mod.rs`
  - `Store::put_message(...)` in `crates/openshell-server/src/persistence/mod.rs`
- Both SQLite and Postgres currently upsert by `(object_type, name)` and replace `payload` wholesale.
- The current behavior preserves the original row `id` while accepting a new caller-supplied `id` on name conflict, which is another signal that updates are name-based overwrites rather than identity/version-checked mutations.
- `ObjectMeta` comments already mention "resource version", but the proto does not actually contain a field for it today.

### Sandbox mutation paths exposed to lost updates

| Location | Current behavior |
|---|---|
| `crates/openshell-server/src/grpc/sandbox.rs` | Attach/detach provider does `get sandbox by name -> mutate providers -> put_message`. |
| `crates/openshell-server/src/compute/mod.rs` | Compute watch/session/delete paths load a sandbox record, mutate status/phase/lifecycle data, and rewrite the object. |
| `crates/openshell-server/src/grpc/policy.rs` | Policy backfill and policy status reporting mutate sandbox fields via full-object writes. |
| `crates/openshell-server/src/grpc/provider.rs` | Provider update is a read/merge/write, while provider delete scans sandboxes then deletes the provider. |
| Settings persistence | Currently relies on a process-local mutex, which is not an HA-safe boundary. |

### Existing patterns to reuse

- Policy revisions already use a DB-enforced optimistic-concurrency pattern based on a unique `(object_type, scope, version)` key and bounded retries.
- The codebase already contains tests showing that read/modify/write without a lock can lose writes.
- Process-local locks can stay temporarily to reduce retries, but correctness needs to move to the database.

## Proposed Design

Add an authoritative `resource_version` to every generic object row and require mutating stored objects through compare-and-swap writes keyed by stable object identity (`id`) plus expected `resource_version`.

This should be a store-level correctness primitive, not a sandbox-specific lock.

### Core design

- Add `resource_version` to the `objects` table in both SQLite and Postgres.
- Treat the DB row's `resource_version` as authoritative.
- On insert, set `resource_version = 1`.
- On update, require `WHERE id = ? AND resource_version = ?`, and atomically bump to the next version.
- On delete, support the same `WHERE id = ? AND resource_version = ?` pattern.
- Read paths must hydrate decoded objects with the DB row's `resource_version`, not trust whatever is currently embedded in the stored payload.
- Update paths must stop using name-based upsert for mutable top-level objects. Resolve by name if needed, then persist by stable `id`.

### Scope boundary

This issue should implement single-object CAS and migrate the high-risk sandbox-related mutation paths first. It should not try to redesign the entire persistence model or normalize every resource into dedicated tables.

Policy revision storage should stay append-only as-is. It already has the right optimistic-concurrency pattern for versioned policy rows.

## API Shape

### Proto / object metadata

Add an actual resource-version field to shared object metadata.

```proto
message ObjectMeta {
  string id = 1;
  string name = 2;
  int64 created_at_ms = 3;
  map<string, string> labels = 4;
  uint64 resource_version = 5;
}
```

Notes:
- This is wire-compatible for existing clients.
- Existing comments already promise resource-version semantics, so this makes the schema match the intent.
- If we want Kubernetes-style opacity later, we can still treat the number as an opaque token at the API boundary.

### Persistence API

Add explicit CAS primitives instead of a generic mutation DSL.

```rust
pub enum WriteCondition {
    MustCreate,
    MatchResourceVersion(u64),
    Unconditional,
}

pub struct WriteResult {
    pub resource_version: u64,
    pub created_at_ms: i64,
    pub updated_at_ms: i64,
}

pub enum PersistenceError {
    // existing variants...
    Conflict {
        current_resource_version: Option<u64>,
    },
}
```

Suggested store methods:

```rust
pub async fn put_if(
    &self,
    object_type: &str,
    id: &str,
    name: &str,
    payload: &[u8],
    labels: Option<&str>,
    condition: WriteCondition,
) -> PersistenceResult<WriteResult>;

pub async fn delete_if(
    &self,
    object_type: &str,
    id: &str,
    expected_resource_version: u64,
) -> PersistenceResult<bool>;
```

Typed wrappers should provide the same semantics for protobuf objects, and typed reads should hydrate `metadata.resource_version` from the DB row.

Pragmatic constraint:
- Keep `put` / `put_message` temporarily for bootstrap and legacy paths.
- New mutable paths should use CAS helpers.
- Once migrated, deprecate unconditional overwrite helpers for top-level mutable objects.

## SQLite / Postgres Implementation Details

### Schema

Add a new non-null column in both backends:

- SQLite: `resource_version INTEGER NOT NULL DEFAULT 1`
- Postgres: `resource_version BIGINT NOT NULL DEFAULT 1`

No new secondary index should be necessary. Updates will use the primary key `id`, plus a version check in the predicate.

### Insert path

For `MustCreate`:

- Insert the row with `resource_version = 1`.
- Preserve current unique constraints:
  - `id` primary key
  - `(object_type, name)` unique when `name IS NOT NULL`

### Update path

For `MatchResourceVersion(v)`:

- Update by `object_type + id + expected resource_version`.
- Set `payload`, `labels`, `updated_at_ms`.
- Set `resource_version = resource_version + 1`.
- Use `RETURNING resource_version, created_at_ms, updated_at_ms` where supported to avoid a second round trip on success.

If the update affects zero rows:
- Follow up with a cheap existence check by `object_type + id`.
- Return `NotFound` if the row is gone.
- Return `Conflict { current_resource_version }` if the row still exists but the version changed.

### Delete path

For CAS deletes:

- `DELETE ... WHERE object_type = ? AND id = ? AND resource_version = ?`.
- On zero rows, distinguish not-found from conflict the same way as updates.

### Read path

Extend `ObjectRecord` with `resource_version`.

When decoding a protobuf object:
- Decode the payload.
- Override or populate `metadata.resource_version` from the DB row before returning it.

This matters because older payloads will not contain the field even after the schema migration.

## Migration / Backcompat Needs

### Database migration

Add a migration for both SQLite and Postgres:

- Add the column.
- Backfill existing rows to `1`.

No payload rewrite is required.

### Proto / client compatibility

- Adding `metadata.resource_version` is backward-compatible in protobuf.
- Older clients will ignore the field.
- Newer clients will start seeing it in `Get*` / `List*` responses without needing request changes.

### Internal compatibility

- Existing persisted payloads remain readable.
- Existing write call sites can keep compiling while we migrate them incrementally.
- `get_message*` should populate `resource_version` from the row so callers do not need to know whether a payload predates the migration.

### Mixed-version rollout caveat

Old binaries will still perform unconditional full-row overwrites without version checks. Because of that:
- Do not enable HA writers during a mixed-version rollout.
- Keep the deployment in single-writer mode until every gateway instance is on the CAS-capable build.
- Call this out explicitly in rollout notes.

## Retry Semantics

The DB layer should not silently retry stale writes. A stale write must surface as a typed conflict so the caller can re-read and re-apply its semantic merge.

Recommended behavior by call site:

- Sandbox attach/detach:
  - Bounded retry, re-read latest sandbox by name, re-run add/remove/dedupe, retry CAS.
  - Safe because desired state is set-like and idempotent.
- Compute watch updates / supervisor session state / startup resume:
  - Bounded retry by sandbox `id`, re-apply derived status/phase changes to the latest row.
  - If retries exhaust, log and rely on the next watch/reconcile event to converge.
- `ReportPolicyStatus` sandbox updates:
  - Bounded retry by sandbox `id` when setting `current_policy_version`.
- Settings mutations:
  - Replace correctness reliance on `settings_mutex` with CAS on the settings object row.
  - The mutex may stay temporarily as a retry-reduction optimization, but not as the only guard.
- Provider update:
  - Bounded retry, re-read latest provider, re-run merge-map logic, retry CAS.
- Delete/reconcile sweeps:
  - Use CAS delete where feasible.
  - On conflict, skip and let the next sweep handle the object instead of forcing a delete.

Recommended retry budget:
- 3 to 5 attempts.
- `tokio::task::yield_now()` or small jitter between attempts.
- Return `ABORTED` after exhaustion for gRPC handlers that cannot safely continue.

## Rollout Plan

1. Add `resource_version` to `ObjectMeta`, `ObjectRecord`, and both DB schemas.
2. Add store-level CAS update/delete primitives and typed conflict errors.
3. Hydrate `metadata.resource_version` on all typed reads.
4. Migrate the highest-risk sandbox writers first:
   - provider attach/detach
   - compute watch/reconcile/session status writes
   - policy backfill of `spec.policy`
   - policy status update of `current_policy_version`
5. Migrate provider update and settings save paths onto the same CAS primitives.
6. Keep any short-term single-gateway lock from PR #1242 only as a best-effort local serialization aid, not as the correctness boundary.
7. Add conflict telemetry and concurrent tests on both SQLite and Postgres.
8. After all sandbox-related writers are CAS-backed, allow HA gateway writers.
9. Deprecate unconditional overwrite helpers for mutable top-level objects and migrate remaining call sites in follow-up work.

## Definition of Done

- `objects.resource_version` exists in both SQLite and Postgres, with migration/backfill coverage.
- `openshell.datamodel.v1.ObjectMeta` includes `resource_version`.
- `ObjectRecord` includes `resource_version`.
- Store exposes CAS update/delete primitives and returns typed conflict errors.
- Typed read helpers hydrate `metadata.resource_version` from the DB row.
- Sandbox attach/detach no longer relies on unconditional `put_message`.
- Compute sandbox status/phase writers no longer rely on unconditional full-row overwrites.
- Policy-related sandbox mutations (`spec.policy` backfill and `current_policy_version`) use CAS.
- Provider update and settings save paths use CAS.
- Concurrency tests cover attach/detach, attach/detach vs compute status update, policy status update vs attach/detach, provider update conflicts, settings conflicts, and SQLite/Postgres backends.
- E2E coverage demonstrates safe sandbox mutation behavior with concurrent writers.
- Metrics/logging exist for conflict count, retry count, and retry exhaustion.

## Notable Risks / Open Questions

- Single-row CAS does not fully solve cross-object invariants. `delete_provider_record()` can still race with sandbox attach unless we add a higher-level transactional composition or redesign the attachment relationship.
- Mixed-version rollouts are unsafe for HA unless explicitly gated to a single writer.
- `put` currently updates by `(object_type, name)` rather than by stable `id`. Migrated call sites must stop depending on name-based overwrite semantics.
- Settings persistence currently generates a fresh UUID on every save attempt before the name-based upsert preserves the old row. That helper should be cleaned up as part of the CAS migration so settings rows retain stable identity cleanly.
- Conflict-heavy paths could become noisy without metrics and bounded retries. We need observability before enabling HA writers.
- We should decide whether future external mutation RPCs need explicit `expected_resource_version` request fields. This issue can defer that if gateway-internal retries are sufficient for current APIs.
- Policy revision rows remain append-only and should not be collapsed into the generic object CAS model.

## Scope Assessment

- Complexity: High
- Confidence: Medium-High
- Estimated files to change: 12-18
- Issue type: `feat`

---
Created from a focused principal-engineer spike. No code changes were made as part of the spike.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): add DB-backed resource_version CAS for stored objects #1255

Problem Statement

Technical Context

What the spike found

Sandbox mutation paths exposed to lost updates

Existing patterns to reuse

Proposed Design

Core design

Scope boundary

API Shape

Proto / object metadata

Persistence API

SQLite / Postgres Implementation Details

Schema

Insert path

Update path

Delete path

Read path

Migration / Backcompat Needs

Database migration

Proto / client compatibility

Internal compatibility

Mixed-version rollout caveat

Retry Semantics

Rollout Plan

Definition of Done

Notable Risks / Open Questions

Scope Assessment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Location	Current behavior
`crates/openshell-server/src/grpc/sandbox.rs`	Attach/detach provider does `get sandbox by name -> mutate providers -> put_message`.
`crates/openshell-server/src/compute/mod.rs`	Compute watch/session/delete paths load a sandbox record, mutate status/phase/lifecycle data, and rewrite the object.
`crates/openshell-server/src/grpc/policy.rs`	Policy backfill and policy status reporting mutate sandbox fields via full-object writes.
`crates/openshell-server/src/grpc/provider.rs`	Provider update is a read/merge/write, while provider delete scans sandboxes then deletes the provider.
Settings persistence	Currently relies on a process-local mutex, which is not an HA-safe boundary.

feat(gateway): add DB-backed resource_version CAS for stored objects #1255

Description

Problem Statement

Technical Context

What the spike found

Sandbox mutation paths exposed to lost updates

Existing patterns to reuse

Proposed Design

Core design

Scope boundary

API Shape

Proto / object metadata

Persistence API

SQLite / Postgres Implementation Details

Schema

Insert path

Update path

Delete path

Read path

Migration / Backcompat Needs

Database migration

Proto / client compatibility

Internal compatibility

Mixed-version rollout caveat

Retry Semantics

Rollout Plan

Definition of Done

Notable Risks / Open Questions

Scope Assessment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions