The Silent API Killer: How Caching Broke My Integration and How I Fixed It
I recently spent two days chasing a ghost in a machine-to-machine API integration. The integration worked flawlessly in staging but in production, our webhook endpoint sporadically failed to acknow...

Source: DEV Community
I recently spent two days chasing a ghost in a machine-to-machine API integration. The integration worked flawlessly in staging but in production, our webhook endpoint sporadically failed to acknowledge receipts from a critical third-party service, causing duplicate processing and missed events. The logs from their service showed a 200 OK response from our endpoint, but our logs showed the corresponding background job never ran. The issue was intermittent and impossible to reproduce locally—a classic sign of a state or environmental problem. After exhaustive logging, I discovered our production API gateway (a layer we didn't control) was aggressively caching POST requests. Because our idempotency key header was static for a period (a business logic decision to group events), the gateway served a cached 200 response for subsequent identical POSTs before they even hit our application server. Our app never saw the request, so the job never enqueued. The "healthy" 200 response from the gat