A Survey on Geographically Distributed Big-Data Processing using MapReduce

MSRDG International Journal of Computer Scientific Technology & Electronics Engineering

 

© 2025 by MSRDG IJCSTEE Journal

Volume 1 Issue 5

Year of Publication: 2025



Authors: A. Saravana Kumar, Saghul Hameed, Ferose Khan
Paper


Download


Article ID
MSRDG-IJCSTEE-V1I5P103
Abstract:

The exponential growth of data generated across geographically separated organizational units has made geo-distributed big-data processing an increasingly critical research area. MapReduce, and its widely adopted open-source implementation Apache Hadoop, remains the dominant paradigm for large-scale parallel data processing. However, its original design assumes a tightly coupled, single data-centre cluster, making it ill-suited for workloads where source data resides across multiple geographically distributed data centres connected by wide-area networks (WANs). Naively executing MapReduce jobs in such environments incurs prohibitive WAN transfer costs, unpredictable latency, and poor data locality, leading to substantial degradation in throughput and job completion time. This survey systematically examines the landscape of techniques, frameworks, and algorithms proposed to adapt MapReduce for geo-distributed deployments. We categorise the literature into five key dimensions: data locality optimisation, cost-aware task scheduling, WAN-bandwidth management, fault tolerance in multi-DC environments, and consistency models under partial network failures. We further consolidate comparative experimental results reported across eighteen primary studies, identify recurring performance trade-offs, and highlight open research challenges. Our analysis reveals that adaptive, topology-aware schedulers and intermediate-data compression can reduce WAN traffic by up to 60% relative to vanilla Hadoop, while proactive replication and speculative execution strategies are essential for meeting latency SLOs in production geo-distributed clusters.

Keywords: MapReduce · Geo-distributed computing · Big data · Apache Hadoop · WAN-aware scheduling · Data locality · Fault tolerance · Cloud computing