DMS Migration (Hybrid-Cloud)
2025
Multi-terabyte, multi-million-record migration from on-premise file storage to cloud object storage (in-region) using Strangler Fig with per-document backend-flag tracking — zero-downtime cutover.
The problem
A multi-terabyte, multi-million-record document store lived on aging on-premise file-storage hardware. Capacity was tight; the operational cost was high; disaster recovery was complex.
The constraint
Documents are read continuously by the Balady citizen super-app. A big-bang cutover would have meant downtime — unacceptable.
The architecture
A hybrid-cloud bridge: on-premise file storage → cloud object storage, in-region, transported over a secure site-to-site tunnel.
The migration uses a Strangler Fig pattern: each document carries a
backend flag in its metadata store (legacy or object-storage) that
routes reads to the right backend. New writes follow each product’s current
backend setting, so a not-yet-migrated product keeps writing to the legacy store
until it is cut over; existing documents are background-migrated in batches, and
the flag flips per document as it moves.
The Attachments Service connects directly to the cloud store over the tunnel to avoid an extra latency hop. An API gateway supplies short-lived object-storage credentials. Because the cross-site link is the throughput bottleneck, transfers are throttled and scheduled into off-peak windows, with connection pooling and multipart uploads for larger objects to keep the migration within the link’s budget.
Pattern view — reads routed per record by a backend flag; background batches migrate and flip flags one at a time.
Trade-offs (ADRs)
- Migrate storage only, keep compute on-prem — moving just the document store (not the applications or metadata database) kept scope small and preserved the operational model; the accepted cost is that link bandwidth between sites becomes the critical performance factor
- Per-document backend flag over a global cutover date — incremental migration, instant rollback per document
- Private tunnel over public object-storage endpoint — latency + sovereignty
- Gateway-brokered credentials, direct data path — the gateway issues short-lived storage credentials, but the service connects to the object store directly so the high-volume data path avoids an extra hop while credential handling stays centralized
Outcome
Migration completed with zero downtime for the citizen-facing API. Pilot product validated the pattern before rollout to the full estate. Beyond capacity relief, the cloud target unlocked pay-per-use economics, cross-region replication for disaster recovery, and automated lifecycle policies for long-term retention compliance.