Reliability is event-driven, not REST
Why every gating integration in the Balady ecosystem now consumes events instead of synchronous callbacks — the lesson behind the Debts Hub revision.
The Debts Hub started life as a REST callback flow: when a citizen settled an outstanding debt, the payment-confirmation flow called back into the hub so the licence-issuing service could proceed. It worked in every demo. Under production load, it failed in the worst possible way — occasionally.
Why occasional failure is the expensive kind
A callback that fails loudly gets fixed in a sprint. A callback that fails one time in a thousand creates a recurring incident class: a citizen has paid, the licence still says blocked, and an operations engineer reconciles the mismatch by hand. The cost isn’t the failure rate — it’s the manual recovery loop attached to every single occurrence.
The revision
We replaced the REST callbacks with RabbitMQ-based event consumption on the payment-confirmation flow:
- The producer’s only job is to durably publish. Once the confirmation event is in the broker, delivery is the broker’s problem — retries, back-pressure, and ordering are infrastructure concerns, not application code.
- The consumer owns its own pace. Under load spikes the queue absorbs the burst; nothing times out, nothing is dropped.
- A scheduled reconciler self-heals the tail. Real-time consumption covers normal operation; a periodic job sweeps for anything that slipped through, so the system converges even after an outage.
That combination eliminated the failure class — not reduced, eliminated — because there is no longer a synchronous moment where two services must both be healthy for state to transfer.
The generalisation
The lesson scaled beyond one system: every gating integration in the Balady ecosystem now consumes events, not synchronous callbacks. If service B cannot proceed until service A’s state changes, that state change travels as a durable event. REST stays for queries and commands where the caller can meaningfully handle a failure right now; it is the wrong tool when the caller’s failure handling is “hope and retry.”
The full architecture is in the Debts Hub case study, including the container view and the ADRs behind the hybrid real-time + reconciliation design.