Document Clustering for Digital Forensic Analysis

MSRDG International Journal of Computer Scientific Technology & Electronics Engineering

 

© 2025 by MSRDG IJCSTEE Journal

Volume 1 Issue 3

 

Year of Publication: 2025



Authors: S. Alangaram, K. Ramu
Paper


Download


Article ID
MSRDG-IJCSTEE-V1I3P102
Abstract:

The exponential growth of digital evidence in criminal investigations has created a pressing need for automated, scalable methods to organise and interpret large document corpora. This paper presents a hybrid document clustering framework specifically designed for digital forensic workflows. The proposed system integrates BERT-based semantic embeddings with an adaptive k-means/DBSCAN ensemble, enabling investigators to group electronically stored information — emails, system logs, legal records, and multimedia metadata — into semantically coherent clusters without prior knowledge of category boundaries. We evaluate the framework on four heterogeneous forensic datasets totalling over 36,000 documents. The proposed approach achieves a Silhouette Score of 0.70, a macro F1-Score of 0.85, and a Davies-Bouldin Index of 1.02, outperforming standalone k-Means, DBSCAN, and Hierarchical Agglomerative Clustering (HAC) on all metrics. Scalability experiments confirm near-linear growth in processing time up to 100,000 documents. The findings demonstrate that linguistically-informed clustering substantially reduces evidence review time and supports chain-of-custody requirements in forensic investigations.

Keywords: Digital forensics, Document clustering, BERT embeddings, k-Means, DBSCAN, Evidence analysis, Natural language processing