Clustering Data Using Compression: A Universal Method for Cross-Domain Data Analysis
Title: Clustering Data Using Compression: A Universal Method for Cross-Domain Data Analysis
Abstract: This research presents a new method for clustering data based on compression. This method does not rely on specific features or background knowledge and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area and works across application area boundaries. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) using a new quartet method and a fast heuristic. The method is implemented and available as public software, and is robust under the choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we present new evidence for major questions in Mammalian evolution, based on whole mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.
Research Question: Can a universal method for clustering data be developed that works across different application areas without relying on specific features or background knowledge?
Methodology: The research proposes a new method for clustering data based on compression. This method, called the normalized compression distance or NCD, is universal in that it is not restricted to a specific application area and can be applied across different domains. The method determines a universal similarity distance by computing the lengths of compressed data files and using a hierarchical clustering method to extract a dendrogram. It is robust under the choice of different compressors and has been successfully applied in various fields, demonstrating its universality.
Results: The research presents evidence of successful application in diverse areas such as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains. In genomics, the research provides new evidence for major questions in Mammalian evolution based on whole mitochondrial genomic analysis.
Implications: The universal method for clustering data based on compression presented in this research has significant implications for data analysis. It offers a robust and flexible approach that works across different application areas without relying on specific features or background knowledge. This can lead to new insights and discoveries in various fields by allowing for the clustering and analysis of data from diverse sources.
Link to Article: https://arxiv.org/abs/0312044v2 Authors: arXiv ID: 0312044v2