Publication View

Record Linkage as DNA Sequence Alignment Problem (2009)

Abstract
Since modern database applications increasingly need to deal with dirty data due to a variety of reasons (e.g., data entry errors, heterogeneous formats, and ambiguous terms), considerable recent efforts have focused on the (record) linkage problem to determine if two entities represented as relational records are approximately the same or not. In this paper, we propose a novel idea of using the popular gene sequence alignment algorithm in Biology – BLAST. Our proposal, termed as the BLASTed linkage, is based on the observations that: (1) both problems are variants of approximate pattern matching, (2) BLAST provides the statistical guarantee of search results in a scalable manner – a greatly lacking feature in many linkage solutions, and (3) by transforming the record linkage problem into the gene sequence alignment problem, one can leverage on a wealth of advanced algorithms, implementations, and tools that have been actively developed for BLAST during the last decade. In translating English alphabets to DNA sequences of A, C, G, and T, we study four variations: (1) default – each alphabet is mapped to nucleotides under 1, 2, and 4-bit coding schemes, (2) weighted – tokens are elongated or shortened proportional to their importance, making important tokens longer in the resultant DNA sequences, (3) hybrid – each token’s lexical meaning as well as its importance are considered at the same time during translation, and (4) multi-bit – tokens are selected for any of 1, 2, and 4-bit coding schemes based on the cumulative distribution functions of their importance. The efficacy of our proposed BLASTed linkage schemes are experimentally validated using both real and synthetic data sets. 1.

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.7730
Source http://www.vldb.org/conf/2008/workshops/WProc_qdbmud/linkage2.pdf
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Type text
Language English
Relation 10.1.1.23.7142, 10.1.1.55.2700, 10.1.1.112.8784, 10.1.1.39.4336, 10.1.1.1.5288, 10.1.1.15.8644, 10.1.1.89.231, 10.1.1.13.9086, 10.1.1.10.8160, 10.1.1.78.6751, 10.1.1.96.8496