We Decomposed a Monolithic Layer Processing 12,000+ Daily Transactions into Seven Independent Services — With 0 Downtime

By Brainstack TechnologiesMicroservices Architecture2024

A B2B distribution company processing roughly 12,000 orders per day through an integration layer that connected their ERP (SAP Business One), warehouse management system, three logistics partners, and a payment gateway had reached a breaking point. The integration layer — a single Node.js monolith that had grown to around 180,000 lines of code over four years — was deployed as one unit. Every deployment was a full-system event that required a 45-minute maintenance window, typically scheduled for Sunday nights. A bug in the logistics rate-calculation module in February had taken down payment processing for three hours because both flows shared the same runtime.

Brainstack Technologies led a seven-month migration that decomposed this monolith into seven independently deployable microservices — without requiring any system downtime and without disrupting the 12,000+ orders flowing through the pipeline daily.

Project Overview

ClientA B2B distribution company (name withheld under NDA)

IndustryWholesale Distribution & Logistics

Scale~12,000 daily order transactions, 180K LOC monolith, 3 logistics partners, 1 payment gateway

Engagement Duration7 months (6 weeks architecture & planning, 5 months phased extraction)

Team3 backend engineers, 1 DevOps/infrastructure engineer, 1 architect (part-time), 1 QA engineer

ChallengeA four-year-old Node.js monolith handling all integration flows as a single deployable unit — creating cascading failure risk, 45-minute deployment windows, and inability to scale individual flows independently

SolutionStrangler fig migration extracting seven domain-aligned microservices behind a Kong API gateway, deployed on Kubernetes with independent CI/CD pipelines per service

The Challenge

The integration layer had started life five years earlier as a straightforward Node.js application that connected SAP Business One to a single logistics provider via REST APIs. Over four years, as the company onboarded two additional logistics partners (one using SFTP file exchange, one using SOAP APIs), added a payment gateway integration, and built inventory synchronization with their warehouse management system, the codebase grew to approximately 180,000 lines — all in a single deployable unit.

By the time we were brought in, the problems were compounding:

The February outage. A rate-calculation change introduced a memory leak that exhausted the Node.js heap within two hours. Because all flows — including payment processing — ran in the same process, the leak took down the entire system. Orders couldn't be processed for three hours. Post-mortem estimated the cost at roughly $85,000 in delayed shipments and penalty fees.

Sunday-night maintenance windows. Every release required redeploying the entire monolith — a 45-minute window with no orders processed. The ops team had to coordinate with logistics partners to pause inbound feeds, requiring 72-hour advance notice. Minor fixes became a multi-day process.

All-or-nothing scaling. During peak seasons, order volume tripled. Payment processing needed to scale, but shared resources with logistics and inventory meant provisioning three times the infrastructure across all flows — most of which didn't need the extra capacity.

Developer velocity stalled. Four developers in the same codebase meant constant merge conflicts. A change to the payment flow required regression testing against logistics and inventory integrations. The team spent roughly 30% of their time on integration testing that only existed because of architectural coupling.

Monolithic ERP integration architecture with scaling bottlenecks highlighted

Service boundary analysis mapping integration domains for decomposition

Our Approach

Strangler Fig Migration Strategy

We ruled out a big-bang rewrite immediately. The integration layer was processing 12,000+ orders per day; it couldn't go offline, and the business couldn't tolerate running a new untested system in parallel for months. Instead, we used a strangler fig approach: extracting one service at a time from the monolith while the remaining monolith continued to handle everything else.

The first six weeks were spent on architecture and decomposition planning. We analyzed the monolith's code, database schema, and runtime call patterns (we instrumented the monolith with OpenTelemetry for two weeks to capture actual request flows, not just what the code suggested). This analysis revealed seven natural domain boundaries:

Order Ingestion — receiving and validating incoming orders from the ERP
Payment Processing — gateway communication, authorization, settlement
Logistics: Partner A — REST-based carrier integration
Logistics: Partner B — SFTP file exchange (daily batch)
Logistics: Partner C — SOAP API integration
Inventory Sync — bidirectional sync with the warehouse management system
Notification & Alerting — order confirmations, shipment tracking, failure alerts

We intentionally split logistics into three separate services rather than one unified "logistics service." The three partners used fundamentally different protocols (REST, SFTP, SOAP), had different SLA requirements, and changed at different rates. Combining them into a single service would have recreated the coupling problem at a smaller scale.

The extraction order was deliberate: we started with Notification & Alerting (lowest risk, no transactional data, easiest to validate) and ended with Payment Processing (highest risk, regulatory requirements, most complex error handling). This gave the team progressively harder challenges rather than starting with the most dangerous one.

API Gateway and Service Communication

We deployed Kong as the API gateway in front of both the monolith and the emerging services. During migration, Kong handled the routing logic: requests for extracted domains (e.g., /notifications/*) were routed to the new service, while everything else continued to hit the monolith. As each service was extracted, we updated Kong's routing configuration — no code changes to the monolith required for the switchover.

For inter-service communication, we used two patterns based on the consistency requirements of each flow:

Synchronous REST for the order-payment flow. When an order comes in, payment authorization must happen immediately and return a success/failure before the order is confirmed. This is a hard consistency requirement — eventual consistency is not acceptable for payment authorization. These calls go through internal REST APIs with circuit breakers (we used Opossum in Node.js) to prevent cascading failures.

Asynchronous messaging via RabbitMQ for everything else. Inventory updates, logistics dispatch notifications, and alerting all use event-driven messaging. When an order is confirmed, an "order.confirmed" event is published to RabbitMQ, and the relevant services consume it independently. If the notification service is temporarily down, the message waits in the queue — the order isn't affected.

The hardest communication problem was the logistics batch service (Partner B). This partner expected a single consolidated SFTP file every four hours, but orders trickled in continuously. We built a small aggregation service that consumed individual order events from RabbitMQ, batched them into 4-hour windows, generated the SFTP file in the partner's expected format, and uploaded it on schedule. This service was arguably the most custom piece of the entire architecture.

Containerization and Orchestration

Each of the seven services was containerized with Docker and deployed on a managed Kubernetes cluster (AWS EKS). We set up independent CI/CD pipelines using GitHub Actions — each service has its own repository, its own test suite, its own pipeline, and can be deployed to staging or production independently.

Deployment went from a 45-minute Sunday-night maintenance window to a rolling update that completes in under 4 minutes per service with zero downtime (Kubernetes rolling deployment strategy with readiness probes). The team now deploys individual services 8-12 times per week across the seven services combined, compared to the previous cadence of once per week for the entire monolith.

For the payment service specifically, we configured more conservative deployment guardrails: canary deployments that route 5% of payment traffic to the new version for 10 minutes before proceeding, automatic rollback if the error rate exceeds 0.5%, and a mandatory staging environment test against the payment gateway's sandbox before production deployment. The February outage had made leadership understandably cautious about payment-related changes.

Technology Stack

Service Layer

Node.js (Express) — Order Ingestion, Payment Processing, and three Logistics services. Node was the monolith's original language, so most extraction was straightforward.
Python (FastAPI) — Inventory Sync service. The warehouse management system's SDK was Python-only, so this service was written in Python from scratch rather than wrapping the SDK in a Node.js child process.

Communication & Routing

Kong API Gateway — chosen over AWS API Gateway because Kong allowed us to run the same gateway configuration in local development and production, simplifying the dev workflow. The routing rules that split traffic between monolith and services were managed as code in Kong's declarative config.
RabbitMQ — chosen over Kafka because message throughput (~12K orders/day) didn't justify Kafka's operational complexity. RabbitMQ's simpler queue model was a better fit for the event patterns we needed.

Data

PostgreSQL 15 — each service owns its own database schema. We enforced schema-per-service at the PostgreSQL level to prevent accidental cross-service queries.

Infrastructure

AWS EKS (Kubernetes), Docker, GitHub Actions for CI/CD (one pipeline per service)

Observability

OpenTelemetry for distributed tracing, ELK Stack for centralized logging, Prometheus + Grafana for metrics and SLO dashboards

Results

Deployment Speed

Deploy time dropped from 45 minutes (full monolith) to under 4 minutes per service (rolling update, zero downtime). Maintenance windows eliminated entirely.

Deployment Frequency

From 1 deployment per week (Sunday night, coordinated) to 8-12 deployments per week across services. Individual services updated 1-3 times per week.

Incident Blast Radius

Before: a bug in any module could take down all 12,000+ daily transactions. After: failures isolated to the affected service. Payment service had two incidents — neither affected logistics or inventory.

Scaling Efficiency

First post-migration peak: Payment and Order Ingestion scaled to 3x while five services stayed at baseline — ~60% savings on peak-season infrastructure vs. scaling the entire monolith.

New Integration Speed

Onboarding a fourth logistics partner took 3 weeks post-migration. Pre-migration estimate: 8-10 weeks due to regression testing and deployment coordination.

Developer Productivity

Merge conflicts dropped substantially. Integration-related overhead down from ~30% to ~10% of developer time.

Observability & Monitoring

We established the observability stack before extracting the first service — this was one of the most valuable decisions in the project. By instrumenting the monolith with OpenTelemetry first, the team could see request flows across the monolith's internal modules. When we started extracting services, the tracing data simply reflected the new boundaries without any additional instrumentation work.

The observability stack includes:

Distributed tracing (OpenTelemetry + Jaeger): Every request that enters the system gets a trace ID that follows it across all seven services. When the ops team investigates a slow order, they can see exactly which service introduced the latency — including the time spent waiting for external partner APIs that Brainstack doesn't control.

Centralized logging (ELK Stack): All service logs ship to a shared Elasticsearch cluster, correlated by trace ID. Searching for a specific order ID returns logs from every service that touched that order, in chronological order.

SLO dashboards (Prometheus + Grafana): Each service has defined SLOs — for example, the Payment Processing service targets a p99 latency under 800ms and an error rate below 0.1%. When a service approaches its error budget, the dashboard alerts the on-call engineer before users are affected. The team reviews SLO compliance weekly.

The observability investment paid for itself within the first month: the team's mean time to diagnose production issues dropped from roughly 2 hours (grepping through monolith logs) to about 15 minutes (tracing the request flow visually in Jaeger).

Observability dashboard with distributed tracing, latency, error rate, and throughput

CI/CD pipeline from code commit through tests and deploy to Kubernetes

Deployment pipeline with independent service rollout and rollback controls

Distributed tracing dashboard showing request flows across microservices

Key Engineering Lessons

Instrument before you extract. We deployed OpenTelemetry on the monolith two weeks before extracting the first service. This gave the team baseline visibility into request flows, latency, and error rates — which meant we could immediately compare a newly extracted service's performance against its monolith-era baseline. Without this, we would have been flying blind on whether each extraction improved or degraded performance.

Split logistics into three services, not one. Our initial architecture proposed a single "Logistics Service" handling all three partners. During planning, we realized that Partner B (SFTP batch) and Partner A (REST real-time) had fundamentally different runtime characteristics, failure modes, and change frequencies. Combining them would have created a mini-monolith. The decision to split saved us from re-coupling the architecture we had just decoupled.

The shared database was the real migration bottleneck, not the code. Extracting service code was relatively straightforward. The hard part was untangling shared database tables. The monolith used a single PostgreSQL database where the payment module and the logistics module both read from an "orders" table with 47 columns. We had to decide which service owned which columns, build data synchronization events for cross-service reads, and migrate foreign key relationships — all while 12,000 orders per day continued flowing through the system.

Team topology had to change alongside the architecture. The four developers who previously worked on the monolith initially continued reviewing each other's PRs across all services. This recreated the coordination overhead that microservices were supposed to eliminate. We restructured into two pairs, each responsible for a set of services end-to-end (deploy, monitor, fix). Deployment frequency doubled within two weeks of this change.

Spending Sunday Nights on Deployments?

If your deployments require maintenance windows, your scaling bills spike because you can't scale individual components, or a bug in one module can take down unrelated flows — the architecture is working against you. We start with a two-week instrumentation and analysis phase to map your actual request flows, identify natural service boundaries, and build a phased migration plan that doesn't require betting the business on a big-bang rewrite.

Discuss Your Architecture

We Decomposed a Monolithic Layer Processing 12,000+ Daily Transactions into Seven Independent Services — With 0 Downtime

Project Overview

The Challenge

Our Approach

Strangler Fig Migration Strategy

API Gateway and Service Communication

Containerization and Orchestration

Technology Stack

Service Layer

Communication & Routing

Data

Infrastructure

Observability

Results

Deployment Speed

Deployment Frequency

Incident Blast Radius

Scaling Efficiency

New Integration Speed

Developer Productivity

Observability & Monitoring

Key Engineering Lessons

Spending Sunday Nights on Deployments?

Explore Our Services

Newest Post

RAG Systems: Architecture Decisions That Matter

Staff Augmentation vs. Outsourcing: A Decision Framework

EUDR Compliance Technology: A Practical Guide

Progressive Web Apps: The Future of Web Development

Technologies

Have a Vision? Let's Talk.

More Success Stories

Custom Software

Mobile Development

Data Engineering

Product Engineering

AI & Machine Learning

Scalable Microservices Architecture: Transforming Enterprise Applications

We Decomposed a Monolithic Layer Processing 12,000+ Daily Transactions into Seven Independent Services — With 0 Downtime

Project Overview

The Challenge

Our Approach

Strangler Fig Migration Strategy

API Gateway and Service Communication

Containerization and Orchestration

Technology Stack

Service Layer

Communication & Routing

Data

Infrastructure

Observability

Results

Deployment Speed

Deployment Frequency

Incident Blast Radius

Scaling Efficiency

New Integration Speed

Developer Productivity

Observability & Monitoring

Key Engineering Lessons

Spending Sunday Nights on Deployments?

Explore Our Services

Newest Post

Technologies

Have a Vision? Let's Talk.

More Success Stories

Have a Project In Mind?