The previous articles in this series described how devopsellence delegates secrets, deploy state, images, signing, and identity to GCP services. This article answers the practical question: what actually happens when each layer fails?

The answer is specific and testable for each failure mode. The design principle is that failures cascade outward from the runtime, never inward. The control plane is the furthest component from your users. GCP services are in the middle. Your servers and the agent are closest. A failure at the outer edge should never reach the inner edge.

Control plane down

The control plane is a Rails application. It can go down for many reasons: bad deploy, database migration, transient infrastructure issue, or planned maintenance.

What keeps working:

What pauses:

Recovery: When the control plane comes back, agents resume normal operation. No manual intervention needed. The reconciliation loop picks up where it left off.

Cloud Storage impaired

GCS holds the signed desired-state envelopes that agents poll every 2 seconds.

What keeps working:

What pauses:

Recovery: Agents pick up the latest envelope from GCS on the next successful poll. Sequence numbers ensure they never apply stale state — they'll skip straight to the newest envelope.

Secret Manager impaired

Secret Manager holds runtime secret values that agents resolve during deployment.

What keeps working:

What pauses:

Recovery: Agents resolve secrets on the next successful Secret Manager call and proceed with reconciliation.

Artifact Registry impaired

Artifact Registry stores Docker images.

What keeps working:

What pauses:

Agent restarts

The agent process on your server can restart (OOM kill, manual restart, server reboot).

What happens:

The gap between agent restart and traffic recovery depends on container start time and health check configuration. For a typical web app, this is seconds, not minutes.

The cascade principle

Each failure mode is contained to its layer:

×
Control plane fails

Management pauses. Running apps unaffected. No customer traffic impact.

×
GCP service fails

New deploys stall. Running apps continue from cache. Customer traffic continues.

×
Agent restarts

Brief traffic gap during container startup. Recovers from disk cache in seconds.

Failures at the outer edge (control plane) have the least impact on users. Failures at the inner edge (agent/server) have the most, but recover the fastest. This is the opposite of a monolithic architecture where a single database outage takes down everything.

The three-layer split doesn't prevent failures. It contains them. Your users don't care if the control plane is down for 10 minutes, as long as their apps keep serving traffic. That's the property this architecture is designed to provide.

Next: Part 7 — What's next: standalone mode and multi-provider support