Incident #INC-2847 — Root cause analysis, fix evaluation, and hardening plan
This document is a component showcase. Some sections exist solely to demonstrate specific Prism components and may feel disconnected from the narrative. In a real document, these elements would be woven into the flow naturally — or omitted entirely.
Two decisions came out of this incident: how to fix the immediate cascade failure, and how to prevent similar failures long-term. A third option (full service mesh migration) was evaluated and deferred.
Deploy Hystrix-style circuit breakers at every service boundary. When the inventory service fails, the order service stops retrying after 3 attempts and returns a graceful degradation response instead of cascading the failure downstream.
Decouple the order pipeline from synchronous inventory checks by introducing a message queue. Orders would be accepted immediately and processed asynchronously — eliminates the tight coupling that caused the cascade.
Istio would provide circuit breaking, retries, and observability out of the box, but the migration cost is estimated at 6 engineer-weeks. The team agreed to revisit after Q3 when the platform team finishes the Kubernetes migration.
The order pipeline has two paths: the hot path (every checkout request) and the cold path (async fulfillment, triggered after payment confirmation).
The order service calls the inventory service synchronously to reserve stock. The core logic is in
export async function processCheckout(cart: Cart, userId: string) {
const validated = await validateCart(cart);
// validated = { items: [{ sku: "WH-1000XM5", qty: 2, price: 349.99 }], total: 699.98 }
if (!validated.ok) throw new OrderError('invalid_cart');
const reservation = await inventoryClient.reserve(validated.items);
// reservation = { reservationId: "res_7f3a", expiresAt: "2026-01-15T10:47:00Z" }
if (!reservation.success) throw new OrderError('out_of_stock');
const payment = await paymentClient.createIntent({
amount: validated.total,
currency: 'USD',
metadata: { reservationId: reservation.reservationId }
});
return { orderId: payment.orderId, clientSecret: payment.clientSecret };
}
The synchronous inventoryClient.reserve() call at line 64 is where the cascade failure originated. No timeout was configured, so a hung inventory service blocked the entire order pipeline.
The fulfillment worker dequeues orders from
export async function processFulfillment(job: FulfillmentJob) {
const order = await db.order.findById(job.orderId);
if (!order || order.status !== 'paid') return;
const shipment = await warehouseClient.createShipment({
items: order.items,
address: order.shippingAddress,
priority: order.expedited ? 'express' : 'standard',
});
await db.order.update(order.id, {
status: 'shipped',
trackingNumber: shipment.trackingId,
});
}
The fulfillment path was unaffected by this incident because it runs asynchronously via a message queue — exactly the pattern proposed for the hot path.
Circuit breakers solve the immediate cascade risk but introduce new operational concerns — mainly around tuning thresholds and handling partial failures gracefully.
Response time comparison — current state vs. projected with circuit breakers:
Investigation revealed three issues, ranked by severity.
The inventoryClient.reserve() call has no configured timeout. When the inventory service became unresponsive, every in-flight checkout request blocked indefinitely, exhausting the connection pool and cascading to all downstream services.
The inventory service deployed an index creation migration that acquired a SHARE lock on the stock_reservations table for 4 minutes. All concurrent INSERT and UPDATE queries blocked waiting for the lock to release.
The checkout handler in
const reservation = await inventoryClient.reserve(validated.items);
// inventoryClient has no deadline/timeout configured
// gRPC default: wait forever
// At 14:23:01 UTC, this call hangs — the inventory DB is locked
The request thread blocks. New requests keep arriving and also block.
After 45 seconds, all 100 connections in the pool are occupied by hanging requests:
export const inventoryClient = new InventoryClient(
process.env.INVENTORY_SERVICE_URL,
grpc.credentials.createInsecure(),
{
'grpc.max_concurrent_streams': 100,
// No deadline configured
// No keepalive configured
}
);
Root cause: no deadline option passed to call-site invocations. The client waits forever for a response that will never come. Fix: pass { deadline: Date.now() + 5000 } as call options on every gRPC invocation.
Even with timeouts, a sustained inventory outage will generate thousands of timeout errors. A circuit breaker in
const breaker = new CircuitBreaker(inventoryClient.reserve, {
timeout: 5000, // 5s per call
errorThresholdPercentage: 50,
resetTimeout: 30000, // try again after 30s
});
const reservation = await breaker.fire(validated.items);
The migration that triggered this incident used CREATE INDEX without CONCURRENTLY, acquiring a SHARE lock on a hot table that blocked all concurrent writes. Migration tooling should enforce CONCURRENTLY for indexes on tables above 100K rows.
-- This migration locked stock_reservations for ~4 minutes
CREATE INDEX idx_reservations_expiry
ON stock_reservations (expires_at);
-- Should have been:
-- CREATE INDEX CONCURRENTLY idx_reservations_expiry
-- ON stock_reservations (expires_at);
-- (CONCURRENTLY avoids the SHARE lock but cannot run inside a transaction)
Four API endpoints were impacted during the 23-minute outage window.
| Endpoint | Impact | Status |
|---|---|---|
POST /api/orders |
100% failure — all checkout requests timed out | |
GET /api/inventory/:sku |
Stock queries hung on locked table | |
POST /api/payments/intent |
No new intents created (upstream blocked) | |
GET /api/orders/:id |
Read-only, unaffected |
Create a new order from a validated shopping cart.
{
"cartId": "cart_8f3a2b",
"shippingAddress": { "line1": "123 Main St", "city": "Portland", "state": "OR", "zip": "97201" },
"paymentMethod": "pm_visa_4242"
}
{
"orderId": "ord_9c4d1e",
"status": "pending_payment",
"clientSecret": "pi_3Mx_secret_abc123",
"expiresAt": "2026-01-15T10:47:00Z"
}
Five phases from trigger to full resolution. We are currently in phase 4 — the circuit breaker implementation is in progress.
Rollback steps for each remediation phase:
Irreversible step: Once the async queue is deployed and processing live orders, reverting to synchronous checkout requires draining all in-flight messages. Coordinate with oncall before attempting.
The fix touches 4 modules across 11 files.
These services have direct dependencies on the order pipeline.
Manages stock reservations and warehouse sync. The cascade failure originated here when a migration locked the stock_reservations table.
Processes Stripe payment intents. Was indirectly affected — no new intents could be created because the order service was blocked upstream.
Breakdown of error types during the 23-minute outage window, based on API Gateway logs:
Three approaches were evaluated. The circuit breaker approach was adopted as the immediate fix; async decoupling is planned for Q2.
Wrap every cross-service call in a circuit breaker (opossum library). When failures exceed the threshold, the breaker trips and returns a fallback immediately — no more cascading waits.
Decouple the checkout flow entirely. Accept orders synchronously, then process inventory reservation + payment asynchronously via a message queue (BullMQ on Redis).
Deploy Istio service mesh with built-in circuit breaking, retry policies, and distributed tracing. Handles the problem at the infrastructure layer.
Failure modes beyond the primary cascade scenario. Handled cases reference their guard code; unhandled cases are flagged.
Inventory reserves 2 of 3 items, then times out on the third —
Two users checkout the last item simultaneously — handled. SELECT FOR UPDATE SKIP LOCKED:
const stock = await db.query(
`SELECT * FROM stock_reservations
WHERE sku = $1 AND status = 'available'
FOR UPDATE SKIP LOCKED
LIMIT $2`, [sku, quantity]
);
Service recovers briefly then fails again — handled. The half-open state in
gRPC deadline computed on client clock, evaluated on server clock — handled. Deadlines use relative duration (5000ms from now), not absolute timestamps.
Test status by functional area. Critical gaps flagged in red.
it('creates order from valid cart')it('rejects checkout when item unavailable')it('validates cart structure')it('opens circuit after 5 failures')it('sends probe in half-open state')describe('CircuitBreaker', () => {
it('opens circuit after 5 failures', async () => {
const mockFn = vi.fn().mockRejectedValue(new Error('timeout'));
const breaker = new CircuitBreaker(mockFn, { errorThreshold: 5 });
for (let i = 0; i < 5; i++) {
await expect(breaker.fire()).rejects.toThrow('timeout');
}
// 6th call should fail immediately without calling mockFn
await expect(breaker.fire()).rejects.toThrow('circuit open');
expect(mockFn).toHaveBeenCalledTimes(5); // not 6
});
it('sends probe in half-open state', async () => {
// ... setup breaker in open state, advance timers past resetTimeout
vi.advanceTimersByTime(30000);
await breaker.fire(); // should call mockFn once
expect(mockFn).toHaveBeenCalledTimes(1);
});
});
it('rejects CREATE INDEX without CONCURRENTLY')Tune the circuit breaker and timeout thresholds below. Export when ready.
Hover over dotted terms for explanations. First occurrences are marked with a dot.
When the
The same word means different things depending on context — not a value judgment, just a semantic distinction.
DEADLINE_EXCEEDED. The server may still be processing.Explaining the circuit breaker pattern through a familiar concept.
How the order service should handle an inventory call failure — the branching logic after adding the circuit breaker.
1,247 orders received 502 errors during the 23-minute window. Customers saw a "temporarily unavailable" message. No partial orders were created — the failure occurred before payment intent creation, so no charges were made. Customers can simply retry checkout.
Yes — currently only the inventory service has a circuit breaker. The fix in
export function withBreaker<T>(
name: string,
fn: (...args: any[]) => Promise<T>,
opts?: Partial<BreakerOptions>
): (...args: any[]) => Promise<T> {
const breaker = new CircuitBreaker(fn, {
timeout: opts?.timeout ?? 5000,
errorThresholdPercentage: opts?.errorThreshold ?? 50,
resetTimeout: opts?.resetTimeout ?? 30000,
});
breaker.on('open', () => metrics.increment(`circuit.${name}.open`));
return (...args) => breaker.fire(...args);
}
Deployment coordination: The circuit breaker config must be deployed to all order service pods simultaneously. A rolling deploy with mixed versions will cause inconsistent behavior — some pods trip the breaker while others keep retrying.
Multi-dimensional assessment of the proposed fix. Each track is an independent review dimension, not a sequential step.
The withBreaker factory has metric hooks but the Datadog exporter is not wired up. Add breaker.on('open', ...) callbacks in the factory.
CONCURRENTLY, the root trigger can recur.
The root cause conclusion is supported by three independent data sources.
CREATE INDEX without CONCURRENTLY that held a SHARE lock on the stock_reservations table for 4 minutes.
pg_stat_activity showed 100 queries in wait_event_type = Lock state on the stock_reservations table during the incident window. Lock holder was the migration process (PID 48291).
grpc.client.duration for inventory.Reserve spiking to 30,000ms (the default gRPC deadline: infinity, capped by the HTTP/2 idle timeout). No DEADLINE_EXCEEDED errors because no deadline was set.
POST /api/orders returned 502 between 14:23:01 and 14:46:18 UTC. Zero 502s in the preceding 24 hours.