| High-dimensional Proximity Joins (2007) | |||||||||||||||
Abstract | |||||||||||||||
| Many emerging data mining applications require a proximity (similarity) join between points in a high-dimensional domain. We present a new algorithm that utilizes a new data structure, called the ffl-kd tree, for fast spatial proximity joins on high-dimensional points. This data structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence the proposed data structure scales to high-dimensional data. We analyze the cost of the join for the ffl-kd tree and the R-tree family, and show that the ffl-kd tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life datasets, shows that proximity join using the ffl-kd tree is typically 2 to 40 times faster than the R tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the ffl-kd tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for highdimensional proximity joins, but do not match the performance of the ffl-kd tree. 1 | |||||||||||||||
Publication details | |||||||||||||||
| |||||||||||||||