David D. Lewis

This (2008)

Er Genkin, David D. Lewis, David Madigan

paper describes an application of Bayesian logistic regression to text catego-

Abstract Training Algorithms for Linear Text Classifiers (2008)

David D. Lewis, Robert E. Schapire, James P. Calian, Ron Papka

Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used...

Data Extraction as Text Categorization: An Experiment With the MUC-3 Corpu s (2008)

Introductio N, David D. Lewis

The data extraction systems studied in the MUC-3 evaluation perform a variety of subtasks in fillin g out templates. Some of these tasks are quite complex, and seem to require a system to represen t...

Automatic Classification of Web Queries Using Very Large Unlabeled Query Logs (2008)

Steven M. Beitzel, Eric C. Jensen, David D. Lewis

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route...

Aqsaqal Enterprises (2008)

Aynur Dayanik, Vladimir Menkov, David D. Lewis, Alexander Genkin, David Madigan

Supervised learning approaches to text classification are in practice often required to work with small and unsystematically collected training sets. The alternative is usually viewed as building...

Dimacs At The Trec 2005 Genomics Track (2008)

Aynur Dayanik Alex, Alex Genkin, Paul Kantor, David D. Lewis, David Madigan

This report describes DIMACS work on the text categorization task of the TREC 2005 Genomics track. Our approach to this task was similar to the triage subtask studied in the TREC 2004 Genomics track....

y (2007)

Raj D. Iyer, David D. Lewis, Robert E. Schapire, Yoram Singer, Amit Singhal

RankBoost is a recently proposed algorithm for learning ranking functions. It is simple to implement and has strong justifications from computational learning theory. We describe the algorithm and...

information (2007)

David D. Lewis, Karen Sparck Jones

Natural language processing for

The (2007)

Michelle Keim Condli, David D. Lewis, David Madigan

We propose a Bayesian methodology for recommender systems that incorporates user ratings, user features, and item features in a single unified framework. In principle our approach should address the...

susanae stat. rutgers. edu (2007)

Susana Eyheramendy, David D. Lewis, David Madigan

madiganstat. rutgers. edu This paper empirically compares the performance of four probabilistic models for text classification- Poisson, Bernoulli, Multinomial and Negative Binomial. We examine the...

susanae stat. rutgers. edu (2007)

Susana Eyheramendy, David D. Lewis, David Madigan

madiganstat. rutgers. edu This paper empirically compares the performance of four probabilistic models for text classification- Poisson, Bernoulli, Multinomial and Negative Binomial. We examine the...

Bayesian Multinomial Logistic Regression for Author Identification (2005)

David Madigan, Alexander Genkin, David D. Lewis, Dmitriy Fradkin

Motivated by high-dimensional applications in authorship atttribution, we describe a Bayesian multinomial logistic regression model together with an associated learning algorithm.

Author Identification on the Large Scale (2005)

David Madigan, Alexander Genkin, David D. Lewis, Shlomo Argamon, Dmitriy Fradkin, ...

this paper is on techniques for identifying authors in large collections of textual artifacts (e-mails, communiques, transcribed speech, etc.). Our approach focuses on very high-dimensional,...

Bayesian multinomial logistic regression for author identification (2005)

David Madigan, Er Genkin, David D. Lewis

Abstract. Motivated by high-dimensional applications in authorship atttribution, we describe a Bayesian multinomial logistic regression model together with an associated learning algorithm.

RCV1: A new benchmark collection for text categorization research (2004)

David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li, G. Dietterich, Fan Li

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on...

RCV1: A New Benchmark Collection for Text Categorization Research (2004)

David D. Lewis, Yiming Yang, Tony G. Rose, G. Dietterich, Fan Li, Fan Li

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on...

Dimacs At The Trec 2004 Genomics Track (2004)

Aynur Dayanik Dmitriy, Dmitriy Fradkin, Alex Genkin, Paul Kantor, David D. Lewis, David Madigan, ...

DIMACS participated in the text categorization and ad hoc retrieval tasks of the TREC 2004 Genomics track. For the categorization task, we tackled the triage and annotation hierarchy subtasks.

Sparse bayesian classifiers for text categorization (2003)

Susana Eyheramendy, Er Genkin, Wen-hua Ju, David D. Lewis, David Madigan

This paper empirically compares the performance of different Bayesian models for text categorization. In particular we examine so-called “sparse ” Bayesian models that explicitly favor...

Sparse Bayesian Classifiers for Text Categorization (U) Alexander Genkin (2003)

David Lewis Susana, Er Genkin, David D. Lewis, Susana Eyheramendy, Wen-hua Ju

U) This paper empirically compares the performance of di#erent Bayesian models for text categorization. In particular we examine so-called "sparse" Bayesian models that explicitly favor...

Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks (2001)

David Lewis Independent, David D. Lewis

this paper. Here's the history: 1. Avi Arampatzis wrote (15-August-2001) to the TREC filtering mailing list, worrying that using only the top 1000 docs in the routing evaluation wouldn't be...

Bayesian mixed-effects models for recommender systems (1999)

Michelle Keim Condliff, David D. Lewis, David Madigan

We propose a Bayesian methodology for recommender systems that incorporates user ratings, user features, and item features in a single unified framework. In principle our approach should address the...

AT&T at TREC-7 (1999)

Amit Singhal, John Choi, Donald Hindle, David D. Lewis, Fernando Pereira

This year AT&T participated in the ad-hoc task and the Filtering, SDR, and VLC tracks. Most of our effort for TREC-7 was concentrated on SDR and VLC tracks. On the filtering track, we tested a...

Bayesian Mixed-Effects Models for Recommender Systems (1999)

Michelle Keim Condliff, David D. Lewis, David Madigan

We propose a Bayesian methodology for recommender systems that incorporates user ratings, user features, and item features in a single unified framework. In principle our approach should address the...

ATT at TREC-7 (1999)

Amit Singhal, John Choi, Donald Hindle, David D. Lewis, Fernando Pereira

This year AT&T participated in the ad-hoc task and the Filtering, SDR, and VLC tracks. Most of our effort for TREC-7 was concentrated on SDR and VLC tracks. On the filtering track, we tested a...

AT&T at TREC-7 (1999)

Amit Singhal, John Choi, Donald Hindle, David D. Lewis, Fernando Pereira

This year AT&T participated in the ad-hoc task and the Filtering, SDR, and VLC tracks. Most of our e ort for TREC-7 was concentrated on SDR and VLC tracks. On the ltering track, we tested a...

Naive (Bayes) at forty: The independence assumption in information retrieval (1998)

David D. Lewis

Abstract. The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive...

Naive (Bayes) at forty: The independence assumption in information retrieval (1998)

David D. Lewis

Abstract. The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive...

The TREC-5 Filtering Track (1997)

David D. Lewis

The TREC-5 filtering track, an evaluation of binary text classification systems, was a repeat of the filtering evaluation run in a trial version for TREC-4, with only the data set and participants...

Approximating Matrix Multiplication for Pattern Recognition Tasks (1997)

Edith Cohen, David D. Lewis

Many pattern recognition tasks, including estimation, classification, and the finding of similar objects, make use of linear models. The fundamental operation in such tasks is the computation of the...

Bayesian Information Retrieval: Preliminary Evaluation (1997)

Michelle Keim, David D. Lewis, David Madigan

Given a database of documents and a user's query, how can we locate those documents that meet the user's information needs? Because there is no precise definition of which documents in the...

The TREC-4 Filtering Track (1997)

David D. Lewis

The TREC-4 filtering track was an experiment in the evaluation of binary text classification systems. In contrast to ranking systems, binary text classification systems may need to produce result...

Information retrieval and the statistics of large data sets (1996)

David D. Lewis

Providing content-based access to large quantities of text is a difficult task, given our poor understanding of the formal semantics of human language. The most successful approaches to retrieval,...

Natural Language Processing for Information Retrieval (1996)

David D. Lewis, Karen Sparck Jones

The paper summarizes the essential properties of document retrieval and reviews both conventional practice and research findings, the latter suggesting that simple statistical techniques can be...

Training Algorithms for Linear Text Classifiers (1996)

David D. Lewis, Robert E. Schapire, James P. Callan, Ron Papka

Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used...

Evaluating and optimizing autonomous text classification systems (1995)

David D. Lewis

Text retrieval systems typically produce a ranking of documents and let a user decide how far down that ranking to go. In contrast, programs that filter text streams, software that categorizes...

Active by Accident: Relevance Feedback in Information Retrieval (1995)

David D. Lewis

Relevance feedback is a supervised learning method used to improve the effectiveness of information retrieval systems. It is an active learning technique, and may be the most widespread application...

A Sequential Algorithm for Training Text Classifiers (1994)

Lewis, David D., Gale, William A.

The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully...

A sequential algorithm for training text classifiers (1994)

David D. Lewis, William A. Gale

The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully...

Fax: An Alternative to SGML (1994)

Kenneth W. Church, William A. Gale, Jonathan I. Helfman, David D. Lewis

We have argued elsewhere (Church and Mercer, 1993) that text is more available than ever before, and that the availability of massive quantities of data has been responsible for much of the recent...

A sequential algorithm for training text classifiers (1994)

David D. Lewis

At ACM SIGIR '94, I compared the effectiveness of uncertainty sampling with that of random sampling and relevance sampling in choosing training data for a text categorization data set [1]....

A Comparison of Two Learning Algorithms for Text Categorization (1994)

David D. Lewis, Marc Ringuette

This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information...

Heterogeneous Uncertainty Sampling for Supervised Learning (1994)

David D. Lewis, Jason Catlett

Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of...

Heterogeneous uncertainty sampling for supervised learning (1994)

David D. Lewis, Jason Catlett

Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of...

Feature Selection and Feature Extraction for Text Categorization (1992)

David D. Lewis

The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization...

Learning in Intelligent Information Retrieval (1991)

David D. Lewis

Information retrieval (IR) systems are used for finding, within a large text database, those documents containing information needed by a user. The complex and poorly understood semantics of...

Evaluating Text Categorization (1991)

David D. Lewis

While certain standard procedures are widely used for evaluating text retrieval systems and algorithms, the same is not true for text categorization. Omission of important data from reports is common...

Term clustering of syntactic phrases (1990)

David D. Lewis, W. Bruce Croft

Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for...

Term clustering of syntactic phrases (1990)

David D. Lewis, W. Bruce Croft

Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for...

An approach to natural language processing for document retrieval (1987)

W. Bruce Croft, David D. Lewis

Document retrieval systems have been restricted, by the nature of the task, to techniques that can be used with large numbers of documents and broad domains. The most effective techniques that have...