Scalable Microservices Architecture: Transforming Enterprise Applications
We Decomposed a Monolithic Layer Processing 12,000+ Daily Transactions into Seven Independent Services — With 0 Downtime

A B2B distribution company processing roughly 12,000 orders per day through an integration layer that connected their ERP (SAP Business One), warehouse management system, three logistics partners, and a payment gateway had reached a breaking point. The integration layer — a single Node.js monolith that had grown to around 180,000 lines of code over four years — was deployed as one unit. Every deployment was a full-system event that required a 45-minute maintenance window, typically scheduled for Sunday nights. A bug in the logistics rate-calculation module in February had taken down payment processing for three hours because both flows shared the same runtime.
Brainstack Technologies led a seven-month migration that decomposed this monolith into seven independently deployable microservices — without requiring any system downtime and without disrupting the 12,000+ orders flowing through the pipeline daily.
Project Overview
The Challenge
The integration layer had started life five years earlier as a straightforward Node.js application that connected SAP Business One to a single logistics provider via REST APIs. Over four years, as the company onboarded two additional logistics partners (one using SFTP file exchange, one using SOAP APIs), added a payment gateway integration, and built inventory synchronization with their warehouse management system, the codebase grew to approximately 180,000 lines — all in a single deployable unit.
By the time we were brought in, the problems were compounding:


Our Approach
Strangler Fig Migration Strategy
We ruled out a big-bang rewrite immediately. The integration layer was processing 12,000+ orders per day; it couldn't go offline, and the business couldn't tolerate running a new untested system in parallel for months. Instead, we used a strangler fig approach: extracting one service at a time from the monolith while the remaining monolith continued to handle everything else.
The first six weeks were spent on architecture and decomposition planning. We analyzed the monolith's code, database schema, and runtime call patterns (we instrumented the monolith with OpenTelemetry for two weeks to capture actual request flows, not just what the code suggested). This analysis revealed seven natural domain boundaries:
- Order Ingestion — receiving and validating incoming orders from the ERP
- Payment Processing — gateway communication, authorization, settlement
- Logistics: Partner A — REST-based carrier integration
- Logistics: Partner B — SFTP file exchange (daily batch)
- Logistics: Partner C — SOAP API integration
- Inventory Sync — bidirectional sync with the warehouse management system
- Notification & Alerting — order confirmations, shipment tracking, failure alerts
We intentionally split logistics into three separate services rather than one unified "logistics service." The three partners used fundamentally different protocols (REST, SFTP, SOAP), had different SLA requirements, and changed at different rates. Combining them into a single service would have recreated the coupling problem at a smaller scale.
The extraction order was deliberate: we started with Notification & Alerting (lowest risk, no transactional data, easiest to validate) and ended with Payment Processing (highest risk, regulatory requirements, most complex error handling). This gave the team progressively harder challenges rather than starting with the most dangerous one.
API Gateway and Service Communication
We deployed Kong as the API gateway in front of both the monolith and the emerging services. During migration, Kong handled the routing logic: requests for extracted domains (e.g., /notifications/*) were routed to the new service, while everything else continued to hit the monolith. As each service was extracted, we updated Kong's routing configuration — no code changes to the monolith required for the switchover.
For inter-service communication, we used two patterns based on the consistency requirements of each flow:
Synchronous REST for the order-payment flow. When an order comes in, payment authorization must happen immediately and return a success/failure before the order is confirmed. This is a hard consistency requirement — eventual consistency is not acceptable for payment authorization. These calls go through internal REST APIs with circuit breakers (we used Opossum in Node.js) to prevent cascading failures.
Asynchronous messaging via RabbitMQ for everything else. Inventory updates, logistics dispatch notifications, and alerting all use event-driven messaging. When an order is confirmed, an "order.confirmed" event is published to RabbitMQ, and the relevant services consume it independently. If the notification service is temporarily down, the message waits in the queue — the order isn't affected.
The hardest communication problem was the logistics batch service (Partner B). This partner expected a single consolidated SFTP file every four hours, but orders trickled in continuously. We built a small aggregation service that consumed individual order events from RabbitMQ, batched them into 4-hour windows, generated the SFTP file in the partner's expected format, and uploaded it on schedule. This service was arguably the most custom piece of the entire architecture.
Containerization and Orchestration
Each of the seven services was containerized with Docker and deployed on a managed Kubernetes cluster (AWS EKS). We set up independent CI/CD pipelines using GitHub Actions — each service has its own repository, its own test suite, its own pipeline, and can be deployed to staging or production independently.
Deployment went from a 45-minute Sunday-night maintenance window to a rolling update that completes in under 4 minutes per service with zero downtime (Kubernetes rolling deployment strategy with readiness probes). The team now deploys individual services 8-12 times per week across the seven services combined, compared to the previous cadence of once per week for the entire monolith.
For the payment service specifically, we configured more conservative deployment guardrails: canary deployments that route 5% of payment traffic to the new version for 10 minutes before proceeding, automatic rollback if the error rate exceeds 0.5%, and a mandatory staging environment test against the payment gateway's sandbox before production deployment. The February outage had made leadership understandably cautious about payment-related changes.
Technology Stack
Service Layer
- Node.js (Express) — Order Ingestion, Payment Processing, and three Logistics services. Node was the monolith's original language, so most extraction was straightforward.
- Python (FastAPI) — Inventory Sync service. The warehouse management system's SDK was Python-only, so this service was written in Python from scratch rather than wrapping the SDK in a Node.js child process.
Communication & Routing
- Kong API Gateway — chosen over AWS API Gateway because Kong allowed us to run the same gateway configuration in local development and production, simplifying the dev workflow. The routing rules that split traffic between monolith and services were managed as code in Kong's declarative config.
- RabbitMQ — chosen over Kafka because message throughput (~12K orders/day) didn't justify Kafka's operational complexity. RabbitMQ's simpler queue model was a better fit for the event patterns we needed.
Data
- PostgreSQL 15 — each service owns its own database schema. We enforced schema-per-service at the PostgreSQL level to prevent accidental cross-service queries.
Infrastructure
- AWS EKS (Kubernetes), Docker, GitHub Actions for CI/CD (one pipeline per service)
Observability
- OpenTelemetry for distributed tracing, ELK Stack for centralized logging, Prometheus + Grafana for metrics and SLO dashboards
Results
Deployment Speed
Deploy time dropped from 45 minutes (full monolith) to under 4 minutes per service (rolling update, zero downtime). Maintenance windows eliminated entirely.
Deployment Frequency
From 1 deployment per week (Sunday night, coordinated) to 8-12 deployments per week across services. Individual services updated 1-3 times per week.
Incident Blast Radius
Before: a bug in any module could take down all 12,000+ daily transactions. After: failures isolated to the affected service. Payment service had two incidents — neither affected logistics or inventory.
Scaling Efficiency
First post-migration peak: Payment and Order Ingestion scaled to 3x while five services stayed at baseline — ~60% savings on peak-season infrastructure vs. scaling the entire monolith.
New Integration Speed
Onboarding a fourth logistics partner took 3 weeks post-migration. Pre-migration estimate: 8-10 weeks due to regression testing and deployment coordination.
Developer Productivity
Merge conflicts dropped substantially. Integration-related overhead down from ~30% to ~10% of developer time.
Observability & Monitoring
We established the observability stack before extracting the first service — this was one of the most valuable decisions in the project. By instrumenting the monolith with OpenTelemetry first, the team could see request flows across the monolith's internal modules. When we started extracting services, the tracing data simply reflected the new boundaries without any additional instrumentation work.
The observability stack includes:
Distributed tracing (OpenTelemetry + Jaeger): Every request that enters the system gets a trace ID that follows it across all seven services. When the ops team investigates a slow order, they can see exactly which service introduced the latency — including the time spent waiting for external partner APIs that Brainstack doesn't control.
Centralized logging (ELK Stack): All service logs ship to a shared Elasticsearch cluster, correlated by trace ID. Searching for a specific order ID returns logs from every service that touched that order, in chronological order.
SLO dashboards (Prometheus + Grafana): Each service has defined SLOs — for example, the Payment Processing service targets a p99 latency under 800ms and an error rate below 0.1%. When a service approaches its error budget, the dashboard alerts the on-call engineer before users are affected. The team reviews SLO compliance weekly.
The observability investment paid for itself within the first month: the team's mean time to diagnose production issues dropped from roughly 2 hours (grepping through monolith logs) to about 15 minutes (tracing the request flow visually in Jaeger).




Key Engineering Lessons
Instrument before you extract. We deployed OpenTelemetry on the monolith two weeks before extracting the first service. This gave the team baseline visibility into request flows, latency, and error rates — which meant we could immediately compare a newly extracted service's performance against its monolith-era baseline. Without this, we would have been flying blind on whether each extraction improved or degraded performance.
Split logistics into three services, not one. Our initial architecture proposed a single "Logistics Service" handling all three partners. During planning, we realized that Partner B (SFTP batch) and Partner A (REST real-time) had fundamentally different runtime characteristics, failure modes, and change frequencies. Combining them would have created a mini-monolith. The decision to split saved us from re-coupling the architecture we had just decoupled.
The shared database was the real migration bottleneck, not the code. Extracting service code was relatively straightforward. The hard part was untangling shared database tables. The monolith used a single PostgreSQL database where the payment module and the logistics module both read from an "orders" table with 47 columns. We had to decide which service owned which columns, build data synchronization events for cross-service reads, and migrate foreign key relationships — all while 12,000 orders per day continued flowing through the system.
Team topology had to change alongside the architecture. The four developers who previously worked on the monolith initially continued reviewing each other's PRs across all services. This recreated the coordination overhead that microservices were supposed to eliminate. We restructured into two pairs, each responsible for a set of services end-to-end (deploy, monitor, fix). Deployment frequency doubled within two weeks of this change.
Spending Sunday Nights on Deployments?
If your deployments require maintenance windows, your scaling bills spike because you can't scale individual components, or a bug in one module can take down unrelated flows — the architecture is working against you. We start with a two-week instrumentation and analysis phase to map your actual request flows, identify natural service boundaries, and build a phased migration plan that doesn't require betting the business on a big-bang rewrite.
Explore Our Services
Newest Post
Technologies
- Cloud
- DevOps
- Microservices
- React
- Node.js
- AI / ML
- Mobile
- Kubernetes
- API Design

Have a Vision? Let's Talk.
More Success Stories
Explore how we've helped other businesses with similar challenges.
Have a Project In Mind?
Whether you need a dedicated team, a quick consultation, or end-to-end development — we're here to help you ship faster and smarter.
- Clear delivery ownership
- Fast onboarding with senior teams
- Flexible engagement options








