MSRDG | International Journal of Computer

Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop
MSRDG International Journal of Computer Scientific Technology & Electronics Engineering © 2025 by MSRDG IJCSTEE Journal Volume 1 Issue 5 Year of Publication: 2025 Authors: Ritesh Kumar, P. Balu, D. Mani, B. Lavanya	Paper Download Article ID MSRDG-IJCSTEE-V1I5P105
Abstract: The exponential growth of unstructured textual data in distributed computing environments has intensified the need for intelligent, context-sensitive compression strategies. Conventional compression approaches apply uniform algorithms uniformly across entire datasets, failing to exploit the heterogeneous semantic density inherent in natural language corpora. This paper proposes Content-Aware Partial Compression (CAPC), a novel framework integrated with the Apache Hadoop ecosystem that performs fine-grained, block-level compression by first classifying each text block according to its semantic density using Term Frequency–Inverse Document Frequency (TF-IDF) scoring and Shannon entropy. High-density blocks are compressed with LZ4 or Snappy; medium-density blocks with Deflate; while low-density blocks bypass compression to preserve MapReduce processing speed. Experiments conducted on five diverse textual datasets ranging from 500 MB to 3 GB demonstrate that CAPC achieves an average compression ratio of 3.31, representing a 13.6% improvement over full-dataset Gzip compression, a 22.5% improvement over Snappy (Full), and reduces MapReduce job execution time by up to 35.2% compared to uncompressed baselines. The framework scales efficiently from 2 to 32 nodes with near-linear throughput gains, validating its suitability for production-scale textual big data analysis pipelines.
Keywords: Big data compression · Hadoop MapReduce · Content-aware compression · TF-IDF · Shannon entropy · HDFS · Partial compression · Text analytics

IJCSTEE MENUS