Publication View

New techniques for efficiently discovering frequent patterns (2005)

Abstract
Because of its theoretical and practical importance, the field of frequent pattern mining has been and remain to be one of the most active research area in KDD. In this dissertation, we study three different problems in frequent pattern mining, mining multipledatasets, mining streaming data, and mining large-scale structures from graph datasets. Our study has not only extended the breadth of frequent pattern mining, but also brought new techniques and algorithms into this field. Specifically, our contributions are as follows. 1. Mining Multiple Datasets: We develop a systematic approach to generate efficient query plans for a single mining query across multiple datasets. We also propose methods to simultaneously optimize multiple such queries and utilize the past mining results in a query-intensive KDD environment. Our experimental results have shown a speedup up to two-order of magnitude comparing with the naive methods without these optimizations. 2. Mining Frequent Itemsets over Streaming Data: We propose a new algorithm StreamMining to discover the frequent itemsets over streaming data. In a single pass, StreamMining will guarantee to find a superset of frequent itemsets, but false positive may occur. If the second pass is allowed, StreamMining will be able to remove the false positive and find the exact frequent itemsets. Our detailed evaluation using both synthetic and real datasets has shown our one-pass algorithm is very accurate in practice, and is also very memory efficient. 3. Mining Frequent Large-Scale Structures from Graph Datasets: We develop a new framework to discover the frequent large-scale structures from graph datasets. This framework is derived from a mathematical concept, topological minor. In this framework, we propose a new algorithm TSMiner, which efficiently enumerates all the frequent large-scale structures in a graph dataset, and a new approach called relabeling function to perform constraint mining. We apply our framework to protein structure data and discover meaningful topological structures. Finally, we demonstrate the viability and scalability of the proposed algorithms on both real and synthetic datasets.

Publication details
Download http://rave.ohiolink.edu/etdc/view?acc_num=osu1121795612
Source http://rave.ohiolink.edu/etdc/view?acc_num=osu1121795612
Publisher Ohio State University / OhioLINK
Repository OhioLINK Electronic Thesis and Dissertation Center (United States)
Keywords Frequent Pattern Mining, Graph Mining, Multiple Query Optimization, Mining Data Streams, Knowledgeable Cache
Type text
Language english

Cited publications (31)
Approximating a Collection of Frequent Sets (2004)
Parallel Mining of Association Rules (1999)
Mining Association Rules between Sets of Items in Large Databases (1997)
Detecting Group Differences: Mining Contrast Sets (2001)
On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets (2002)
DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints (2002)
On Monotone Data Mining Languages (2001)
An Overview of Query Optimization in Relational Systems (1970)
Substructure Discovery Using Minimum Description Length and Background Knowledge (1995)
Mining Frequent Patterns in Data Streams at Multiple Time Granularities (2002)