| Document Clustering for Digital Forensic Analysis | |
|
MSRDG International Journal of Computer Scientific Technology & Electronics Engineering
© 2025 by MSRDG IJCSTEE Journal Volume 1 Issue 3
Year of Publication: 2025 |
Paper Download Article ID MSRDG-IJCSTEE-V1I3P102 |
|
Abstract: The exponential growth of digital evidence in criminal investigations has created a pressing need for automated, scalable methods to organise and interpret large document corpora. This paper presents a hybrid document clustering framework specifically designed for digital forensic workflows. The proposed system integrates BERT-based semantic embeddings with an adaptive k-means/DBSCAN ensemble, enabling investigators to group electronically stored information — emails, system logs, legal records, and multimedia metadata — into semantically coherent clusters without prior knowledge of category boundaries. We evaluate the framework on four heterogeneous forensic datasets totalling over 36,000 documents. The proposed approach achieves a Silhouette Score of 0.70, a macro F1-Score of 0.85, and a Davies-Bouldin Index of 1.02, outperforming standalone k-Means, DBSCAN, and Hierarchical Agglomerative Clustering (HAC) on all metrics. Scalability experiments confirm near-linear growth in processing time up to 100,000 documents. The findings demonstrate that linguistically-informed clustering substantially reduces evidence review time and supports chain-of-custody requirements in forensic investigations. |
|
| Keywords: Digital forensics, Document clustering, BERT embeddings, k-Means, DBSCAN, Evidence analysis, Natural language processing | |
