Order Pipeline Cascade Failure — Postmortem

Incident #INC-2847 — Root cause analysis, fix evaluation, and hardening plan

This document is a component showcase. Some sections exist solely to demonstrate specific Prism components and may feel disconnected from the narrative. In a real document, these elements would be woven into the flow naturally — or omitted entirely.


Key Decisions

Two decisions came out of this incident: how to fix the immediate cascade failure, and how to prevent similar failures long-term. A third option (full service mesh migration) was evaluated and deferred.

Deploy Hystrix-style circuit breakers at every service boundary. When the inventory service fails, the order service stops retrying after 3 attempts and returns a graceful degradation response instead of cascading the failure downstream.

Decouple the order pipeline from synchronous inventory checks by introducing a message queue. Orders would be accepted immediately and processed asynchronously — eliminates the tight coupling that caused the cascade.

Istio would provide circuit breaking, retries, and observability out of the box, but the migration cost is estimated at 6 engineer-weeks. The team agreed to revisit after Q3 when the platform team finishes the Kubernetes migration. Deferred


Architecture

The order pipeline has two paths: the hot path (every checkout request) and the cold path (async fulfillment, triggered after payment confirmation).

Hot path — checkout request

Client API Gateway Order Service Inventory Service

Full request lifecycle

Browser CDN Rate Limiter Auth API Gateway Order Service Inventory Service PostgreSQL Browser

The order service calls the inventory service synchronously to reserve stock. The core logic is in — it validates the cart, reserves inventory, then creates a payment intent:

export async function processCheckout(cart: Cart, userId: string) {
  const validated = await validateCart(cart);
  // validated = { items: [{ sku: "WH-1000XM5", qty: 2, price: 349.99 }], total: 699.98 }
  if (!validated.ok) throw new OrderError('invalid_cart');

  const reservation = await inventoryClient.reserve(validated.items);
  // reservation = { reservationId: "res_7f3a", expiresAt: "2026-01-15T10:47:00Z" }
  if (!reservation.success) throw new OrderError('out_of_stock');

  const payment = await paymentClient.createIntent({
    amount: validated.total,
    currency: 'USD',
    metadata: { reservationId: reservation.reservationId }
  });
  return { orderId: payment.orderId, clientSecret: payment.clientSecret };
}
Payment confirmed Order Service updates status Fulfillment Worker Warehouse API confirms shipment

The fulfillment worker dequeues orders from and coordinates with the warehouse API. Each job has a 30-second timeout and automatic retry with exponential backoff:

export async function processFulfillment(job: FulfillmentJob) {
  const order = await db.order.findById(job.orderId);
  if (!order || order.status !== 'paid') return;

  const shipment = await warehouseClient.createShipment({
    items: order.items,
    address: order.shippingAddress,
    priority: order.expedited ? 'express' : 'standard',
  });

  await db.order.update(order.id, {
    status: 'shipped',
    trackingNumber: shipment.trackingId,
  });
}

Tradeoff Analysis

Circuit breakers solve the immediate cascade risk but introduce new operational concerns — mainly around tuning thresholds and handling partial failures gracefully.

Response time comparison — current state vs. projected with circuit breakers:


Known Issues

Investigation revealed three issues, ranked by severity.

The inventoryClient.reserve() call has no configured timeout. When the inventory service became unresponsive, every in-flight checkout request blocked indefinitely, exhausting the connection pool and cascading to all downstream services.

Trigger condition

The inventory service deployed an index creation migration that acquired a SHARE lock on the stock_reservations table for 4 minutes. All concurrent INSERT and UPDATE queries blocked waiting for the lock to release.

Order service calls reserve()

The checkout handler in sends a gRPC call with no deadline:

const reservation = await inventoryClient.reserve(validated.items);
// inventoryClient has no deadline/timeout configured
// gRPC default: wait forever
// At 14:23:01 UTC, this call hangs — the inventory DB is locked

The request thread blocks. New requests keep arriving and also block.

Connection pool exhaustion

After 45 seconds, all 100 connections in the pool are occupied by hanging requests:

export const inventoryClient = new InventoryClient(
  process.env.INVENTORY_SERVICE_URL,
  grpc.credentials.createInsecure(),
  {
    'grpc.max_concurrent_streams': 100,
    // No deadline configured
    // No keepalive configured
  }
);

Result

Even with timeouts, a sustained inventory outage will generate thousands of timeout errors. A circuit breaker in should trip after 5 consecutive failures and return a cached response or graceful error.

const breaker = new CircuitBreaker(inventoryClient.reserve, {
  timeout: 5000,           // 5s per call
  errorThresholdPercentage: 50,
  resetTimeout: 30000,     // try again after 30s
});
const reservation = await breaker.fire(validated.items);

The migration that triggered this incident used CREATE INDEX without CONCURRENTLY, acquiring a SHARE lock on a hot table that blocked all concurrent writes. Migration tooling should enforce CONCURRENTLY for indexes on tables above 100K rows.

-- This migration locked stock_reservations for ~4 minutes
CREATE INDEX idx_reservations_expiry
  ON stock_reservations (expires_at);

-- Should have been:
-- CREATE INDEX CONCURRENTLY idx_reservations_expiry
--   ON stock_reservations (expires_at);
-- (CONCURRENTLY avoids the SHARE lock but cannot run inside a transaction)

Affected Endpoints

Four API endpoints were impacted during the 23-minute outage window.

Endpoint Impact Status
POST /api/orders 100% failure — all checkout requests timed out Down
GET /api/inventory/:sku Stock queries hung on locked table Down
POST /api/payments/intent No new intents created (upstream blocked) Degraded
GET /api/orders/:id Read-only, unaffected Healthy

POST /api/orders

Create a new order from a validated shopping cart.

Request body

{
  "cartId": "cart_8f3a2b",
  "shippingAddress": { "line1": "123 Main St", "city": "Portland", "state": "OR", "zip": "97201" },
  "paymentMethod": "pm_visa_4242"
}

Response 201

{
  "orderId": "ord_9c4d1e",
  "status": "pending_payment",
  "clientSecret": "pi_3Mx_secret_abc123",
  "expiresAt": "2026-01-15T10:47:00Z"
}

Error codes


Incident Timeline

Five phases from trigger to full resolution. We are currently in phase 4 — the circuit breaker implementation is in progress.

Rollback steps for each remediation phase:

Irreversible step: Once the async queue is deployed and processing live orders, reverting to synchronous checkout requires draining all in-flight messages. Coordinate with oncall before attempting.


Impact Scope

The fix touches 4 modules across 11 files.


Related Components

These services have direct dependencies on the order pipeline.

Manages stock reservations and warehouse sync. The cascade failure originated here when a migration locked the stock_reservations table.

Processes Stripe payment intents. Was indirectly affected — no new intents could be created because the order service was blocked upstream.


Error Distribution

Breakdown of error types during the 23-minute outage window, based on API Gateway logs:


Fix Approaches

Three approaches were evaluated. The circuit breaker approach was adopted as the immediate fix; async decoupling is planned for Q2.

Overview

Wrap every cross-service call in a circuit breaker (opossum library). When failures exceed the threshold, the breaker trips and returns a fallback immediately — no more cascading waits.

Performance

Verdict

Overview

Decouple the checkout flow entirely. Accept orders synchronously, then process inventory reservation + payment asynchronously via a message queue (BullMQ on Redis).

Performance

Blocker

Overview

Deploy Istio service mesh with built-in circuit breaking, retry policies, and distributed tracing. Handles the problem at the infrastructure layer.

Performance

Blocker


Edge Cases

Failure modes beyond the primary cascade scenario. Handled cases reference their guard code; unhandled cases are flagged.

Partial failure

Inventory reserves 2 of 3 items, then times out on the third — Unhandled (reservation for the first 2 items is not released)

Concurrent checkout

Two users checkout the last item simultaneously — handled. uses SELECT FOR UPDATE SKIP LOCKED:

const stock = await db.query(
  `SELECT * FROM stock_reservations
   WHERE sku = $1 AND status = 'available'
   FOR UPDATE SKIP LOCKED
   LIMIT $2`, [sku, quantity]
);

Circuit breaker flapping

Service recovers briefly then fails again — handled. The half-open state in allows only 1 test request through before fully closing the breaker.

Clock skew on deadline

gRPC deadline computed on client clock, evaluated on server clock — handled. Deadlines use relative duration (5000ms from now), not absolute timestamps.


Test Coverage

Test status by functional area. Critical gaps flagged in red.

Checkout flow

Valid cart produces order + payment intent — it('creates order from valid cart') Out-of-stock item returns 409 — it('rejects checkout when item unavailable') Invalid cart returns 400 — it('validates cart structure') Inventory timeout triggers circuit breaker fallback Missing Partial reservation rollback on timeout Missing

Circuit breaker

Breaker trips after threshold failures — it('opens circuit after 5 failures') Half-open state allows single test request — it('sends probe in half-open state') Breaker reset under sustained partial failure Missing
describe('CircuitBreaker', () => {
  it('opens circuit after 5 failures', async () => {
    const mockFn = vi.fn().mockRejectedValue(new Error('timeout'));
    const breaker = new CircuitBreaker(mockFn, { errorThreshold: 5 });

    for (let i = 0; i < 5; i++) {
      await expect(breaker.fire()).rejects.toThrow('timeout');
    }

    // 6th call should fail immediately without calling mockFn
    await expect(breaker.fire()).rejects.toThrow('circuit open');
    expect(mockFn).toHaveBeenCalledTimes(5); // not 6
  });

  it('sends probe in half-open state', async () => {
    // ... setup breaker in open state, advance timers past resetTimeout
    vi.advanceTimersByTime(30000);
    await breaker.fire(); // should call mockFn once
    expect(mockFn).toHaveBeenCalledTimes(1);
  });
});

Migration safety

Non-blocking DDL validation — it('rejects CREATE INDEX without CONCURRENTLY') Advisory lock acquisition before DDL Missing

Interactive Parameters

Tune the circuit breaker and timeout thresholds below. Export when ready.


Inline Term Definitions

Hover over dotted terms for explanations. First occurrences are marked with a dot.

When the circuit breaker detected sustained failures from the inventory service, it entered the half-open state and sent a probe request to test recovery. Once the inventory service responded successfully, the breaker closed and normal traffic resumed.


Concept Contrast

The same word means different things depending on context — not a value judgment, just a semantic distinction.


Analogy

Explaining the circuit breaker pattern through a familiar concept.


Decision Tree

How the order service should handle an inventory call failure — the branching logic after adding the circuit breaker.

Breaker tripped due to sustained failures Normal operation, send gRPC call with 5s deadline

Myth Correction


1,247 orders received 502 errors during the 23-minute window. Customers saw a "temporarily unavailable" message. No partial orders were created — the failure occurred before payment intent creation, so no charges were made. Customers can simply retry checkout.

Yes — currently only the inventory service has a circuit breaker. The fix in is service-specific. The follow-up task is to wrap all cross-service calls in the circuit breaker factory:

export function withBreaker<T>(
  name: string,
  fn: (...args: any[]) => Promise<T>,
  opts?: Partial<BreakerOptions>
): (...args: any[]) => Promise<T> {
  const breaker = new CircuitBreaker(fn, {
    timeout: opts?.timeout ?? 5000,
    errorThresholdPercentage: opts?.errorThreshold ?? 50,
    resetTimeout: opts?.resetTimeout ?? 30000,
  });
  breaker.on('open', () => metrics.increment(`circuit.${name}.open`));
  return (...args) => breaker.fire(...args);
}

Deployment coordination: The circuit breaker config must be deployed to all order service pods simultaneously. A rolling deploy with mixed versions will cause inconsistent behavior — some pods trip the breaker while others keep retrying.


Evaluation Tracks

Multi-dimensional assessment of the proposed fix. Each track is an independent review dimension, not a sequential step.

Circuit breaker prevents cascade failures. gRPC deadlines ensure no request hangs indefinitely. Graceful degradation returns 503 within 2ms when breaker is open. Breaker state transitions (open/close/half-open) are not emitted to Datadog yet. Schema migration tooling still allows blocking DDL. Until the linter enforces CONCURRENTLY, the root trigger can recur. Circuit breaker only covers the inventory call. Payment service and warehouse API calls are still unprotected.

Evidence

The root cause conclusion is supported by three independent data sources.

pg_stat_activity showed 100 queries in wait_event_type = Lock state on the stock_reservations table during the incident window. Lock holder was the migration process (PID 48291). Datadog shows grpc.client.duration for inventory.Reserve spiking to 30,000ms (the default gRPC deadline: infinity, capped by the HTTP/2 idle timeout). No DEADLINE_EXCEEDED errors because no deadline was set. CloudWatch logs confirm 1,247 requests to POST /api/orders returned 502 between 14:23:01 and 14:46:18 UTC. Zero 502s in the preceding 24 hours.