← All writing
Writing

Reliability is event-driven, not REST

Why every gating integration in the Balady ecosystem now consumes events instead of synchronous callbacks — the lesson behind the Debts Hub revision.

The Debts Hub started life as a REST callback flow: when a citizen settled an outstanding debt, the payment-confirmation flow called back into the hub so the licence-issuing service could proceed. It worked in every demo. Under production load, it failed in the worst possible way — occasionally.

Why occasional failure is the expensive kind

A callback that fails loudly gets fixed in a sprint. A callback that fails one time in a thousand creates a recurring incident class: a citizen has paid, the licence still says blocked, and an operations engineer reconciles the mismatch by hand. The cost isn’t the failure rate — it’s the manual recovery loop attached to every single occurrence.

The revision

We replaced the REST callbacks with RabbitMQ-based event consumption on the payment-confirmation flow:

That combination eliminated the failure class — not reduced, eliminated — because there is no longer a synchronous moment where two services must both be healthy for state to transfer.

The generalisation

The lesson scaled beyond one system: every gating integration in the Balady ecosystem now consumes events, not synchronous callbacks. If service B cannot proceed until service A’s state changes, that state change travels as a durable event. REST stays for queries and commands where the caller can meaningfully handle a failure right now; it is the wrong tool when the caller’s failure handling is “hope and retry.”

The full architecture is in the Debts Hub case study, including the container view and the ADRs behind the hybrid real-time + reconciliation design.