What breaks and what doesn't | devopsellence blog

The previous articles in this series described how devopsellence delegates secrets, deploy state, images, signing, and identity to GCP services. This article answers the practical question: what actually happens when each layer fails?

The answer is specific and testable for each failure mode. The design principle is that failures cascade outward from the runtime, never inward. The control plane is the furthest component from your users. GCP services are in the middle. Your servers and the agent are closest. A failure at the outer edge should never reach the inner edge.

Control plane down

The control plane is a Rails application. It can go down for many reasons: bad deploy, database migration, transient infrastructure issue, or planned maintenance.

What keeps working:

Running applications continue serving traffic. Agents don't need the control plane to operate — they read desired state from GCS, secrets from Secret Manager, and images from Artifact Registry.
Envoy continues routing HTTPS traffic. TLS certificates are local to the server.
Agents continue their 2-second reconciliation loop, verifying cached state.

What pauses:

New deploys cannot be published (the CLI can't reach the control plane).
Environment changes, node assignments, and secret management stall.
Token refresh will eventually fail. Access tokens have a 1-hour TTL, and refresh tokens need the control plane to rotate. But agents cache GCP access tokens independently (~1 hour), so GCP access continues for at least that window.
New subject tokens for Workload Identity exchange cannot be issued. Existing GCP tokens continue working until they expire.

Recovery: When the control plane comes back, agents resume normal operation. No manual intervention needed. The reconciliation loop picks up where it left off.

Cloud Storage impaired

GCS holds the signed desired-state envelopes that agents poll every 2 seconds.

What keeps working:

Agents serve from cached state. The in-memory cache holds the most recent verified envelope. The disk cache (desired-state-cache.json) survives agent restarts.
Running containers continue. Health checks continue. Traffic keeps flowing.

What pauses:

New deployments that have been published but not yet fetched by agents will be delayed until GCS recovers.
Agents log cache hits and continue. They don't crash or stop serving.

Recovery: Agents pick up the latest envelope from GCS on the next successful poll. Sequence numbers ensure they never apply stale state — they'll skip straight to the newest envelope.

Secret Manager impaired

Secret Manager holds runtime secret values that agents resolve during deployment.

What keeps working:

Running containers already have their secrets injected as environment variables. A Secret Manager outage doesn't affect containers that are already running.
The agent's reconciliation loop continues. Existing deployments are unaffected.

What pauses:

New deployments that require secret resolution will stall. The agent will retry on the next reconciliation cycle (2 seconds).
If a container restarts (crash, health check failure), the agent needs to resolve secrets again to recreate it. If Secret Manager is down at that moment, the container recreation stalls.

Recovery: Agents resolve secrets on the next successful Secret Manager call and proceed with reconciliation.

Artifact Registry impaired

Artifact Registry stores Docker images.

What keeps working:

If the image is already pulled on the node, the agent uses the local copy. Docker caches images locally, so existing deployments don't need to re-pull.
Running containers continue. Images are needed at pull time, not at run time.

What pauses:

New deployments with a new image tag will stall until the pull succeeds.
If a node is freshly provisioned and has no cached images, the first deployment will wait for Artifact Registry.

Agent restarts

The agent process on your server can restart (OOM kill, manual restart, server reboot).

What happens:

The disk cache (desired-state-cache.json) is loaded on startup. The agent immediately knows what it should be running.
If GCS and Secret Manager are reachable, the agent reconciles to the current state within one cycle (2 seconds).
If GCS is unreachable, the agent falls back to the disk cache and runs containers from cached state.
Envoy is restarted and configured via xDS. Traffic resumes after health checks pass.

The gap between agent restart and traffic recovery depends on container start time and health check configuration. For a typical web app, this is seconds, not minutes.

The cascade principle

Each failure mode is contained to its layer:

Control plane fails

Management pauses. Running apps unaffected. No customer traffic impact.

GCP service fails

New deploys stall. Running apps continue from cache. Customer traffic continues.

Agent restarts

Brief traffic gap during container startup. Recovers from disk cache in seconds.

Failures at the outer edge (control plane) have the least impact on users. Failures at the inner edge (agent/server) have the most, but recover the fastest. This is the opposite of a monolithic architecture where a single database outage takes down everything.

The three-layer split doesn't prevent failures. It contains them. Your users don't care if the control plane is down for 10 minutes, as long as their apps keep serving traffic. That's the property this architecture is designed to provide.

Next: Part 7 — What's next: standalone mode and multi-provider support