Data Pipeline That Consolidates Field Surveys, Satellite Imagery, from Six Countries into a Single Dashboard & Reporting Cycle, Reduced from 5 Weeks to 1-Day

By Brainstack TechnologiesData Engineering2024

An international organization focused on agricultural sustainability research collects assessment data from smallholder farming communities across six countries in East Africa and Latin America. Their assessments measure farm-level practices — soil management, input usage, yield, income — alongside environmental indicators derived from satellite imagery. The data feeds into sustainability benchmarks, certification audits, and advisory programs used by global commodity buyers and development agencies.

Before this engagement, every reporting cycle was a five-week manual process. Field assessors submitted surveys from a mobile data collection app — often with connectivity delays of days or weeks from remote rural areas. Satellite imagery was downloaded and processed manually by a GIS analyst. Partner organizations submitted their data in formats ranging from structured CSVs to scanned PDF tables. The analytics team spent more time cleaning and reconciling data than actually analyzing it.

Brainstack Technologies designed and built an automated data pipeline that ingests data from the field survey platform, satellite imagery APIs, and partner submissions — transforms, validates, and links it — and delivers clean, analysis-ready datasets into a centralized warehouse. The reporting cycle went from five weeks of manual preparation to automated next-day delivery.

Project Overview

ClientAn international agricultural sustainability research organization operating in six countries (name withheld under NDA)

IndustryAgricultural Sustainability & Development

Operating Scale~8,000 smallholder farmer assessments per year across six countries, satellite-derived indicators covering ~120,000 hectares

Engagement Duration4 months (6-week design and connector build, 10-week phased deployment)

Team2 data engineers, 1 GIS/remote sensing specialist (part-time), 1 QA engineer

ChallengeData from mobile surveys (3-week sync delays), satellite imagery, partner CSV/PDF submissions, and government databases — all manually cleaned over a 5-week reporting cycle

SolutionAutomated Python/Airflow pipeline ingesting all sources, standardizing formats, validating with domain rules, linking by farmer/plot IDs, delivering to PostgreSQL/PostGIS warehouse

The Challenge

The organization runs annual sustainability assessments across smallholder farming communities — primarily coffee and cocoa producers — in six countries spanning East Africa and Latin America. The assessment data feeds directly into sustainability benchmarks that global commodity buyers and certification bodies use to evaluate supply chain practices.

Data flows into the system from four distinct source types, each with its own challenges:

Mobile field surveys. ODK-based data collection in rural Burundi, Rwanda, and Colombia where connectivity is intermittent. Surveys sync days or weeks after collection. Timestamps reflect sync time, not collection time — creating ordering challenges downstream.

Satellite imagery. Sentinel-2 optical imagery for NDVI, land-use classification, deforestation change detection. Before automation, a GIS analyst manually downloaded and processed imagery per region — 3–4 days per country.

Partner data submissions. Six partners: two with CSV exports, three with Excel workbooks (varying column structures each cycle), one with scanned PDFs requiring manual extraction. No schema standard across partners.

Government databases. Crop prices, rainfall, regional yields from national and FAO datasets. Sources update irregularly; some only available as downloadable files with no API access.

The existing process: The analytics team manually downloaded data from each source, cleaned it in Excel and R, reformatted to a common structure, resolved identifier mismatches (each partner had different farmer ID schemes), and assembled the combined dataset. This took approximately five weeks per reporting cycle — during which the team was unavailable for actual analysis. Manual cleaning also introduced subtle inconsistencies discovered only at the reporting stage, leading to correction cycles that further delayed outputs.

Fragmented data sources from field surveys, satellites, and partner systems

Manual data cleaning and reconciliation bottleneck in spreadsheets

Our Solution

Data Ingestion Layer

We built seven connectors — one per data source — each handling the specific authentication, format, and error-recovery requirements of its source:

ODK Central API connector: Pulls completed survey submissions via the ODK Central API stack (REST/OpenRosa/OData as needed). Handles the sync-delay problem by using the submission's metadata timestamp (when the assessor recorded it) rather than the server receipt timestamp for ordering. Processes in batches, with idempotent ingestion — if a sync delivers duplicate submissions (common with intermittent connectivity), the pipeline deduplicates by submission UUID without creating duplicate records.

Sentinel-2 connector: Pulls imagery via the Copernicus Data Space Ecosystem (CDSE) platform APIs for defined geographic areas of interest. Downloads are scheduled weekly, with automatic retry on failed tile downloads. Raw imagery is stored in S3 before processing.

Partner CSV/Excel connectors (3 separate connectors): Each partner's connector handles their specific column mapping and format quirks. We built a schema-mapping configuration layer so that when Partner C changed their column names (which happened twice during the engagement), the fix was a config update — not a code change. For the partner submitting scanned PDFs, we built a semi-automated extraction pipeline using Tabula (Java-based, invoked from Python) for table extraction and a validation step that flags rows with low extraction confidence for human review.

Government/FAO data connectors (2 connectors): Scheduled scrapers for national agricultural databases and FAO statistical downloads. These sources have no APIs, so the connectors simulate the download process and parse the resulting files. Update checks run weekly; data is only re-ingested when the source file has changed (checked via hash comparison).

All connectors feed into a staging area in S3, organized by source, date, and batch ID. An Airflow DAG orchestrates the ingestion schedule, with per-connector retry logic, alerting on consecutive failures, and a dead-letter queue for submissions that fail validation repeatedly.

Transformation and Validation Pipeline

Raw data from the staging area passes through a four-stage transformation pipeline orchestrated by Airflow:

Stage 1 — Format standardization: Each source's raw data is transformed to a common internal schema. Column names are mapped, data types are cast (dates, numerics, categoricals), and encoding issues are resolved (a recurring problem with partner Excel files containing mixed-encoding text from local languages).

Stage 2 — Geographic normalization: Farm and plot locations arrive in different formats — some as GPS coordinates (WGS84), some as village/district names, some as relative descriptions ("3km north of [village]"). The pipeline geocodes named locations using a curated gazetteer of farming communities built during the engagement, and all coordinates are projected to a common CRS (EPSG:4326) for consistency with the satellite imagery layer.

Stage 3 — Domain-specific validation: This is where agricultural domain knowledge is encoded into the pipeline. Validation rules flag records where, for example, reported coffee yield per hectare exceeds plausible ranges for the region, where a farmer's reported plot size contradicts the satellite-derived plot boundary by more than 30%, or where assessment dates fall outside the crop season calendar for that country. Flagged records are not discarded — they're routed to a review queue where the analytics team can inspect and either correct or approve them. This was a deliberate design decision: in agricultural data, "outliers" are sometimes the most important data points (a farmer reporting unusually high yield may be using an innovative practice worth studying).

Stage 4 — Record linking: The pipeline links records across sources using a composite key of farmer ID, plot ID, and assessment year. Where farmer IDs don't match across partners (each partner had their own ID scheme), we built a fuzzy matching layer using farmer name, village, and approximate GPS location to suggest matches, with human confirmation required before linking. Over the first two reporting cycles, the confirmed matches were fed back into the system to improve match confidence for subsequent cycles.

Geospatial Data Processing

The geospatial processing component runs as a separate Airflow DAG on dedicated EC2 instances (c5.2xlarge) — sized for the compute-intensive imagery workload and kept separate from the tabular data pipeline to avoid resource contention.

The processing flow for each geographic area of interest:

1. Sentinel-2 tiles covering the area are downloaded from CDSE platform APIs. Cloud masking is applied using the Scene Classification Layer (SCL) map to exclude cloud-covered pixels — a critical step in tropical regions where cloud cover can obscure 50-90% of optical images during rainy seasons.

2. Vegetation indices are computed from the cloud-free imagery — primarily NDVI (Normalized Difference Vegetation Index) for crop health assessment and EVI (Enhanced Vegetation Index) for areas with dense canopy cover where NDVI saturates.

3. Land-use classification is run using a supervised classification model trained on labeled ground-truth data from the first two assessment cycles. The model distinguishes between active cropland, fallow land, forest cover, and non-agricultural land use. Classification accuracy was validated at approximately 82% against a held-out test set of manually labeled plots.

4. Change detection compares current-cycle imagery against the previous cycle to identify deforestation, land-use conversion, or significant changes in vegetation health — indicators that are increasingly relevant for sustainability certification and EUDR compliance reporting.

5. Per-plot statistics are extracted by overlaying the farmer plot boundaries (from GPS coordinates in the survey data) onto the processed imagery, computing mean and variance of vegetation indices within each plot boundary. These per-plot statistics are then joined with the tabular assessment data in the analytics warehouse.

The full geospatial processing run for all six countries takes approximately 8 hours. Before automation, the equivalent manual processing by a GIS analyst took roughly 18 working days per reporting cycle.

Analytics-Ready Data Store

Clean, validated, linked datasets are delivered into a PostgreSQL 15 database with the PostGIS extension for geospatial queries. The schema is organized around three core entities: farmers (~8,000 active records), plots (~12,000 mapped), and assessments (~8,000 per cycle), with satellite-derived indicators stored as plot-level attributes joined by plot ID and assessment year.

The analytics team connects directly via SQL clients and R/Python notebooks. We built a data dictionary (maintained as a versioned Markdown document in the project repository) that documents every table, column, data type, source, and transformation applied. This investment in documentation proved essential: within two months of deployment, the analytics team was writing their own queries against the warehouse without engineering support for most reporting needs.

For audit and reproducibility — critical in sustainability certification — the pipeline maintains full data lineage: every record in the warehouse can be traced back to its source submission, with a log of every transformation and validation step applied. This lineage is queryable, so when a certification auditor asks "where did this number come from?", the answer is a database query — not a conversation with an analyst trying to remember what they did in a spreadsheet six months ago.

Data pipeline architecture showing ingestion, transformation, and analytics layers

Geospatial data processing for agricultural sustainability monitoring

Technology Stack

Data Pipeline

Python 3.11 — all pipeline components, chosen because the geospatial ecosystem (GDAL, Rasterio, GeoPandas) and the analytics team's existing skills (R + Python) made it the natural fit. No framework overhead — plain Python modules orchestrated by Airflow.
Apache Airflow 3.x — DAG-based orchestration for both the tabular and geospatial pipelines. Chosen over Prefect and Dagster because the analytics team had prior Airflow experience, reducing the handover learning curve.
PostgreSQL 15 with PostGIS 3.x — analytics warehouse with geospatial query support. PostGIS was non-negotiable for the per-plot satellite indicator joins.
GDAL + Rasterio — satellite imagery processing (tile handling, reprojection, band math). Rasterio for the Python-friendly API; GDAL for the heavy-lifting transformations.
Tabula — PDF table extraction for the one partner submitting scanned reports.

Infrastructure

AWS: S3 (raw data staging + processed imagery storage), EC2 (pipeline compute + dedicated geospatial processing instances), RDS (managed PostgreSQL)
Docker — containerized pipeline deployment for reproducibility across development and production environments
Alerting — Airflow failure callbacks + Slack notifications for pipeline failures, data quality threshold breaches, and connector errors

Data Quality & Lineage

Automated validation at every pipeline stage with domain-specific rules (crop yield ranges, plot size plausibility, assessment date seasonality)
Full lineage tracking: every warehouse record traceable to source submission, transformation log, and validation outcome
Review queue dashboard (simple Flask app) for the analytics team to inspect and resolve flagged records

Results

Reporting Cycle

From 5 weeks of manual preparation to automated next-day delivery. Analytics team receives validated datasets within 24 hours of field data sync.

Geospatial Processing

Dropped from ~18 working days (manual, per GIS analyst) to ~8 hours (automated, full six-country run).

Data Quality

Validation pipeline catches 3–5% of records per cycle as potential quality issues. Before automation, these were discovered — if at all — during analysis, weeks after ingestion.

Analyst Time Reclaimed

~60% of working time previously consumed by data preparation redirected to actual analysis and reporting.

Cross-Regional Analysis

Unified warehouse supports queries comparing sustainability indicators across all six countries in a single query — impossible before.

EUDR Compliance

Deforestation change detection outputs position the organization for EU Deforestation Regulation compliance reporting for commodity buyers.

Adoption

Within two months, analytics team wrote their own queries without engineering support. Onboarding for new analysts dropped from 3 weeks to 4 days.

Automated data pipeline monitoring dashboard with quality metrics

Analytics-ready sustainability reporting generated from clean datasets

Key Engineering Lessons

Build validation rules with domain experts. Our initial rules flagged anything outside 2 standard deviations — which would flag every farmer in a drought-affected region. We replaced them with domain-specific rules co-designed with the research team: crop yield ranges per country and crop type, plot size plausibility cross-referenced against satellite boundaries, assessment date validation against local crop season calendars. The result caught genuinely problematic records while preserving legitimate variance.

Idempotency is essential with intermittent connectivity. Field assessors in rural Burundi sometimes sync the same batch multiple times when connectivity drops. Every pipeline stage was designed to be safely re-runnable: connectors deduplicate by submission UUID, transformations are deterministic, warehouse load uses upsert logic keyed on composite identifiers.

Separate geospatial compute from tabular pipeline. When Sentinel-2 processing ran on shared infrastructure, it consumed enough CPU/memory to stall survey ingestion. We moved geospatial processing to dedicated EC2 instances with a separate Airflow DAG. Marginal cost increase; significant operational stability.

Invest in schema documentation as production code. The versioned data dictionary documenting every table, column, type, source, and transformation was the highest-ROI deliverable. Within two months, the analytics team wrote their own queries. New analyst onboarding dropped from three weeks to four days.

Is Your Analytics Team Spending More Time Cleaning Data Than Analyzing It?

If your reporting cycles are measured in weeks because data arrives from multiple sources in inconsistent formats, or your analysts spend most of their time on preparation rather than insight generation — you're in the same position this organization was. We start with a data source audit: understanding what data you have, where it comes from, how often it changes, and what your analysts actually need. From there, we scope an automated pipeline that eliminates the manual preparation work and gives your team same-day access to clean, validated, analysis-ready data.

Discuss Your Data Challenges

Data Pipeline That Consolidates Field Surveys, Satellite Imagery, from Six Countries into a Single Dashboard & Reporting Cycle, Reduced from 5 Weeks to 1-Day

Project Overview

The Challenge

Our Solution

Data Ingestion Layer

Transformation and Validation Pipeline

Geospatial Data Processing

Analytics-Ready Data Store

Technology Stack

Data Pipeline

Infrastructure

Data Quality & Lineage

Results

Reporting Cycle

Geospatial Processing

Data Quality

Analyst Time Reclaimed

Cross-Regional Analysis

EUDR Compliance

Adoption

Key Engineering Lessons

Is Your Analytics Team Spending More Time Cleaning Data Than Analyzing It?

Explore Our Services

Newest Post

RAG Systems: Architecture Decisions That Matter

Staff Augmentation vs. Outsourcing: A Decision Framework

EUDR Compliance Technology: A Practical Guide

Progressive Web Apps: The Future of Web Development

Technologies

Have a Vision? Let's Talk.

More Success Stories

Custom Software

Mobile Development

Microservices Application

Product Engineering

AI & Machine Learning

Removing Barriers of Silos by Building a Unified Data Platform

Data Pipeline That Consolidates Field Surveys, Satellite Imagery, from Six Countries into a Single Dashboard & Reporting Cycle, Reduced from 5 Weeks to 1-Day

Project Overview

The Challenge

Our Solution

Data Ingestion Layer

Transformation and Validation Pipeline

Geospatial Data Processing

Analytics-Ready Data Store

Technology Stack

Data Pipeline

Infrastructure

Data Quality & Lineage

Results

Reporting Cycle

Geospatial Processing

Data Quality

Analyst Time Reclaimed

Cross-Regional Analysis

EUDR Compliance

Adoption

Key Engineering Lessons

Is Your Analytics Team Spending More Time Cleaning Data Than Analyzing It?

Explore Our Services

Newest Post

Technologies

Have a Vision? Let's Talk.

More Success Stories

Have a Project In Mind?