You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenShell's generic object store currently persists top-level objects by rewriting the full protobuf payload on put_message, with the database only enforcing uniqueness on id and (object_type, name). That works for a single writer, but it is not safe once multiple gateway instances can mutate the same stored object.
PR #1242 adds sandbox provider attach/detach and currently relies on whole-object sandbox mutation. A single-gateway mutex can reduce same-process races, but it does not protect HA writers, compute watch/reconcile updates, or other gateway code paths that concurrently rewrite the same sandbox row.
The result is classic lost-update behavior: the last full-payload write wins, even if it was based on stale state. We need a DB-backed compare-and-swap/resource-version mechanism at the object-store boundary so correctness does not depend on in-process locks.
Technical Context
What the spike found
The gateway persistence layer stores opaque protobuf payloads with a narrow indexed schema.
Generic object writes go through unconditional overwrite helpers:
Store::put(...) in crates/openshell-server/src/persistence/mod.rs
Store::put_message(...) in crates/openshell-server/src/persistence/mod.rs
Both SQLite and Postgres currently upsert by (object_type, name) and replace payload wholesale.
The current behavior preserves the original row id while accepting a new caller-supplied id on name conflict, which is another signal that updates are name-based overwrites rather than identity/version-checked mutations.
ObjectMeta comments already mention "resource version", but the proto does not actually contain a field for it today.
Sandbox mutation paths exposed to lost updates
Location
Current behavior
crates/openshell-server/src/grpc/sandbox.rs
Attach/detach provider does get sandbox by name -> mutate providers -> put_message.
crates/openshell-server/src/compute/mod.rs
Compute watch/session/delete paths load a sandbox record, mutate status/phase/lifecycle data, and rewrite the object.
crates/openshell-server/src/grpc/policy.rs
Policy backfill and policy status reporting mutate sandbox fields via full-object writes.
crates/openshell-server/src/grpc/provider.rs
Provider update is a read/merge/write, while provider delete scans sandboxes then deletes the provider.
Settings persistence
Currently relies on a process-local mutex, which is not an HA-safe boundary.
Existing patterns to reuse
Policy revisions already use a DB-enforced optimistic-concurrency pattern based on a unique (object_type, scope, version) key and bounded retries.
The codebase already contains tests showing that read/modify/write without a lock can lose writes.
Process-local locks can stay temporarily to reduce retries, but correctness needs to move to the database.
Proposed Design
Add an authoritative resource_version to every generic object row and require mutating stored objects through compare-and-swap writes keyed by stable object identity (id) plus expected resource_version.
This should be a store-level correctness primitive, not a sandbox-specific lock.
Core design
Add resource_version to the objects table in both SQLite and Postgres.
Treat the DB row's resource_version as authoritative.
On insert, set resource_version = 1.
On update, require WHERE id = ? AND resource_version = ?, and atomically bump to the next version.
On delete, support the same WHERE id = ? AND resource_version = ? pattern.
Read paths must hydrate decoded objects with the DB row's resource_version, not trust whatever is currently embedded in the stored payload.
Update paths must stop using name-based upsert for mutable top-level objects. Resolve by name if needed, then persist by stable id.
Scope boundary
This issue should implement single-object CAS and migrate the high-risk sandbox-related mutation paths first. It should not try to redesign the entire persistence model or normalize every resource into dedicated tables.
Policy revision storage should stay append-only as-is. It already has the right optimistic-concurrency pattern for versioned policy rows.
API Shape
Proto / object metadata
Add an actual resource-version field to shared object metadata.
Typed wrappers should provide the same semantics for protobuf objects, and typed reads should hydrate metadata.resource_version from the DB row.
Pragmatic constraint:
Keep put / put_message temporarily for bootstrap and legacy paths.
New mutable paths should use CAS helpers.
Once migrated, deprecate unconditional overwrite helpers for top-level mutable objects.
SQLite / Postgres Implementation Details
Schema
Add a new non-null column in both backends:
SQLite: resource_version INTEGER NOT NULL DEFAULT 1
Postgres: resource_version BIGINT NOT NULL DEFAULT 1
No new secondary index should be necessary. Updates will use the primary key id, plus a version check in the predicate.
Insert path
For MustCreate:
Insert the row with resource_version = 1.
Preserve current unique constraints:
id primary key
(object_type, name) unique when name IS NOT NULL
Update path
For MatchResourceVersion(v):
Update by object_type + id + expected resource_version.
Set payload, labels, updated_at_ms.
Set resource_version = resource_version + 1.
Use RETURNING resource_version, created_at_ms, updated_at_ms where supported to avoid a second round trip on success.
If the update affects zero rows:
Follow up with a cheap existence check by object_type + id.
Return NotFound if the row is gone.
Return Conflict { current_resource_version } if the row still exists but the version changed.
Delete path
For CAS deletes:
DELETE ... WHERE object_type = ? AND id = ? AND resource_version = ?.
On zero rows, distinguish not-found from conflict the same way as updates.
Read path
Extend ObjectRecord with resource_version.
When decoding a protobuf object:
Decode the payload.
Override or populate metadata.resource_version from the DB row before returning it.
This matters because older payloads will not contain the field even after the schema migration.
Migration / Backcompat Needs
Database migration
Add a migration for both SQLite and Postgres:
Add the column.
Backfill existing rows to 1.
No payload rewrite is required.
Proto / client compatibility
Adding metadata.resource_version is backward-compatible in protobuf.
Older clients will ignore the field.
Newer clients will start seeing it in Get* / List* responses without needing request changes.
Internal compatibility
Existing persisted payloads remain readable.
Existing write call sites can keep compiling while we migrate them incrementally.
get_message* should populate resource_version from the row so callers do not need to know whether a payload predates the migration.
Mixed-version rollout caveat
Old binaries will still perform unconditional full-row overwrites without version checks. Because of that:
Do not enable HA writers during a mixed-version rollout.
Keep the deployment in single-writer mode until every gateway instance is on the CAS-capable build.
Call this out explicitly in rollout notes.
Retry Semantics
The DB layer should not silently retry stale writes. A stale write must surface as a typed conflict so the caller can re-read and re-apply its semantic merge.
Add conflict telemetry and concurrent tests on both SQLite and Postgres.
After all sandbox-related writers are CAS-backed, allow HA gateway writers.
Deprecate unconditional overwrite helpers for mutable top-level objects and migrate remaining call sites in follow-up work.
Definition of Done
objects.resource_version exists in both SQLite and Postgres, with migration/backfill coverage.
openshell.datamodel.v1.ObjectMeta includes resource_version.
ObjectRecord includes resource_version.
Store exposes CAS update/delete primitives and returns typed conflict errors.
Typed read helpers hydrate metadata.resource_version from the DB row.
Sandbox attach/detach no longer relies on unconditional put_message.
Compute sandbox status/phase writers no longer rely on unconditional full-row overwrites.
Policy-related sandbox mutations (spec.policy backfill and current_policy_version) use CAS.
Provider update and settings save paths use CAS.
Concurrency tests cover attach/detach, attach/detach vs compute status update, policy status update vs attach/detach, provider update conflicts, settings conflicts, and SQLite/Postgres backends.
E2E coverage demonstrates safe sandbox mutation behavior with concurrent writers.
Metrics/logging exist for conflict count, retry count, and retry exhaustion.
Notable Risks / Open Questions
Single-row CAS does not fully solve cross-object invariants. delete_provider_record() can still race with sandbox attach unless we add a higher-level transactional composition or redesign the attachment relationship.
Mixed-version rollouts are unsafe for HA unless explicitly gated to a single writer.
put currently updates by (object_type, name) rather than by stable id. Migrated call sites must stop depending on name-based overwrite semantics.
Settings persistence currently generates a fresh UUID on every save attempt before the name-based upsert preserves the old row. That helper should be cleaned up as part of the CAS migration so settings rows retain stable identity cleanly.
Conflict-heavy paths could become noisy without metrics and bounded retries. We need observability before enabling HA writers.
We should decide whether future external mutation RPCs need explicit expected_resource_version request fields. This issue can defer that if gateway-internal retries are sufficient for current APIs.
Policy revision rows remain append-only and should not be collapsed into the generic object CAS model.
Scope Assessment
Complexity: High
Confidence: Medium-High
Estimated files to change: 12-18
Issue type: feat
Created from a focused principal-engineer spike. No code changes were made as part of the spike.
Problem Statement
OpenShell's generic object store currently persists top-level objects by rewriting the full protobuf payload on
put_message, with the database only enforcing uniqueness onidand(object_type, name). That works for a single writer, but it is not safe once multiple gateway instances can mutate the same stored object.PR #1242 adds sandbox provider attach/detach and currently relies on whole-object sandbox mutation. A single-gateway mutex can reduce same-process races, but it does not protect HA writers, compute watch/reconcile updates, or other gateway code paths that concurrently rewrite the same sandbox row.
The result is classic lost-update behavior: the last full-payload write wins, even if it was based on stale state. We need a DB-backed compare-and-swap/resource-version mechanism at the object-store boundary so correctness does not depend on in-process locks.
Technical Context
What the spike found
Store::put(...)incrates/openshell-server/src/persistence/mod.rsStore::put_message(...)incrates/openshell-server/src/persistence/mod.rs(object_type, name)and replacepayloadwholesale.idwhile accepting a new caller-suppliedidon name conflict, which is another signal that updates are name-based overwrites rather than identity/version-checked mutations.ObjectMetacomments already mention "resource version", but the proto does not actually contain a field for it today.Sandbox mutation paths exposed to lost updates
crates/openshell-server/src/grpc/sandbox.rsget sandbox by name -> mutate providers -> put_message.crates/openshell-server/src/compute/mod.rscrates/openshell-server/src/grpc/policy.rscrates/openshell-server/src/grpc/provider.rsExisting patterns to reuse
(object_type, scope, version)key and bounded retries.Proposed Design
Add an authoritative
resource_versionto every generic object row and require mutating stored objects through compare-and-swap writes keyed by stable object identity (id) plus expectedresource_version.This should be a store-level correctness primitive, not a sandbox-specific lock.
Core design
resource_versionto theobjectstable in both SQLite and Postgres.resource_versionas authoritative.resource_version = 1.WHERE id = ? AND resource_version = ?, and atomically bump to the next version.WHERE id = ? AND resource_version = ?pattern.resource_version, not trust whatever is currently embedded in the stored payload.id.Scope boundary
This issue should implement single-object CAS and migrate the high-risk sandbox-related mutation paths first. It should not try to redesign the entire persistence model or normalize every resource into dedicated tables.
Policy revision storage should stay append-only as-is. It already has the right optimistic-concurrency pattern for versioned policy rows.
API Shape
Proto / object metadata
Add an actual resource-version field to shared object metadata.
Notes:
Persistence API
Add explicit CAS primitives instead of a generic mutation DSL.
Suggested store methods:
Typed wrappers should provide the same semantics for protobuf objects, and typed reads should hydrate
metadata.resource_versionfrom the DB row.Pragmatic constraint:
put/put_messagetemporarily for bootstrap and legacy paths.SQLite / Postgres Implementation Details
Schema
Add a new non-null column in both backends:
resource_version INTEGER NOT NULL DEFAULT 1resource_version BIGINT NOT NULL DEFAULT 1No new secondary index should be necessary. Updates will use the primary key
id, plus a version check in the predicate.Insert path
For
MustCreate:resource_version = 1.idprimary key(object_type, name)unique whenname IS NOT NULLUpdate path
For
MatchResourceVersion(v):object_type + id + expected resource_version.payload,labels,updated_at_ms.resource_version = resource_version + 1.RETURNING resource_version, created_at_ms, updated_at_mswhere supported to avoid a second round trip on success.If the update affects zero rows:
object_type + id.NotFoundif the row is gone.Conflict { current_resource_version }if the row still exists but the version changed.Delete path
For CAS deletes:
DELETE ... WHERE object_type = ? AND id = ? AND resource_version = ?.Read path
Extend
ObjectRecordwithresource_version.When decoding a protobuf object:
metadata.resource_versionfrom the DB row before returning it.This matters because older payloads will not contain the field even after the schema migration.
Migration / Backcompat Needs
Database migration
Add a migration for both SQLite and Postgres:
1.No payload rewrite is required.
Proto / client compatibility
metadata.resource_versionis backward-compatible in protobuf.Get*/List*responses without needing request changes.Internal compatibility
get_message*should populateresource_versionfrom the row so callers do not need to know whether a payload predates the migration.Mixed-version rollout caveat
Old binaries will still perform unconditional full-row overwrites without version checks. Because of that:
Retry Semantics
The DB layer should not silently retry stale writes. A stale write must surface as a typed conflict so the caller can re-read and re-apply its semantic merge.
Recommended behavior by call site:
id, re-apply derived status/phase changes to the latest row.ReportPolicyStatussandbox updates:idwhen settingcurrent_policy_version.settings_mutexwith CAS on the settings object row.Recommended retry budget:
tokio::task::yield_now()or small jitter between attempts.ABORTEDafter exhaustion for gRPC handlers that cannot safely continue.Rollout Plan
resource_versiontoObjectMeta,ObjectRecord, and both DB schemas.metadata.resource_versionon all typed reads.spec.policycurrent_policy_versionDefinition of Done
objects.resource_versionexists in both SQLite and Postgres, with migration/backfill coverage.openshell.datamodel.v1.ObjectMetaincludesresource_version.ObjectRecordincludesresource_version.metadata.resource_versionfrom the DB row.put_message.spec.policybackfill andcurrent_policy_version) use CAS.Notable Risks / Open Questions
delete_provider_record()can still race with sandbox attach unless we add a higher-level transactional composition or redesign the attachment relationship.putcurrently updates by(object_type, name)rather than by stableid. Migrated call sites must stop depending on name-based overwrite semantics.expected_resource_versionrequest fields. This issue can defer that if gateway-internal retries are sufficient for current APIs.Scope Assessment
featCreated from a focused principal-engineer spike. No code changes were made as part of the spike.