Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop

MSRDG International Journal of Computer Scientific Technology & Electronics Engineering

 

© 2025 by MSRDG IJCSTEE Journal

Volume 1 Issue 5

Year of Publication: 2025



Authors: Ritesh Kumar, P. Balu, D. Mani, B. Lavanya
Paper


Download


Article ID
MSRDG-IJCSTEE-V1I5P105
Abstract:

The exponential growth of unstructured textual data in distributed computing environments has intensified the need for intelligent, context-sensitive compression strategies. Conventional compression approaches apply uniform algorithms uniformly across entire datasets, failing to exploit the heterogeneous semantic density inherent in natural language corpora. This paper proposes Content-Aware Partial Compression (CAPC), a novel framework integrated with the Apache Hadoop ecosystem that performs fine-grained, block-level compression by first classifying each text block according to its semantic density using Term Frequency–Inverse Document Frequency (TF-IDF) scoring and Shannon entropy. High-density blocks are compressed with LZ4 or Snappy; medium-density blocks with Deflate; while low-density blocks bypass compression to preserve MapReduce processing speed. Experiments conducted on five diverse textual datasets ranging from 500 MB to 3 GB demonstrate that CAPC achieves an average compression ratio of 3.31, representing a 13.6% improvement over full-dataset Gzip compression, a 22.5% improvement over Snappy (Full), and reduces MapReduce job execution time by up to 35.2% compared to uncompressed baselines. The framework scales efficiently from 2 to 32 nodes with near-linear throughput gains, validating its suitability for production-scale textual big data analysis pipelines.

Keywords: Big data compression · Hadoop MapReduce · Content-aware compression · TF-IDF · Shannon entropy · HDFS · Partial compression · Text analytics