Clustering Data Using Compression: A Universal Method for Cross-Domain Analysis

From Simple Sci Wiki
Jump to navigation Jump to search

Title: Clustering Data Using Compression: A Universal Method for Cross-Domain Analysis

Research Question: Can a universal method for clustering data be developed that works across different domains without requiring domain-specific features or background knowledge?

Methodology: The researchers proposed a new method for clustering data based on compression, called the Normalized Compression Distance (NCD). This method does not rely on any specific features or background knowledge. It works as follows:

1. Determine a universal similarity distance: The researchers developed the NCD, which is computed from the lengths of compressed data files (singly and in pairwise concatenation). 2. Apply a hierarchical clustering method: The NCD is used to create a dendrogram (binary tree) by using a new quartet method and a fast heuristic to implement it.

Results: The method was implemented and is available as open-source software. To demonstrate its universality and robustness, the researchers applied it to various areas, including genomics, viralogy, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains. They used different types of compressors, including statistical, dictionary, and block sorting compressors.

Implications: The researchers' method provides a universal approach to clustering data that works across different domains without requiring domain-specific features or background knowledge. This makes it a robust and versatile tool for analyzing data in various fields. The method's success in diverse areas suggests that it may have broad applications in data analysis and clustering.

Link to Article: https://arxiv.org/abs/0312044v1 Authors: arXiv ID: 0312044v1