Foster Provost

Submitted to Management Science manuscript MS-00665-2006 Active Feature-Value Acquisition (2009)

Maytal Saar-tsechansky, Prem Melville, Foster Provost

Most induction algorithms for building predictive models take as input training data in the form of feature vectors. Acquiring the values of features may be costly, and simply acquiring all values...

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (2009)

Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...

Social network collaborative filtering (2009)

Rong Zheng, Dennis Wilkinson, Foster Provost

This paper demonstrates that "social network collaborative filtering " (SNCF), wherein user-selected like-minded alters are used to make predictions, can rival traditional user-to-user...

Data Acquisition and Cost-effective Predictive Modeling: Targeting Offers for Electronic Commerce (2008)

Foster Provost

Electronic commerce is revolutionizing the way we think about data modeling, by making it possible to integrate the processes of (costly) data acquisition and model induction. The opportunity for...

Journal of Machine Learning Research Submitted 1/2005; Revised 6/2006; forthcoming Classification in Networked Data: (2008)

A Toolkit, Sofus A. Macskassy, Foster Provost

This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...

Network-Based Marketing: Identifying Likely Adopters via Consumer Networks (2008)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using direct...

A Simple Relational Classifier (2008)

Provost, Foster, Macskassy, Sofus

We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes.We...

Machine Learning from Imbalanced Data Sets 101 (2008)

Provost, Foster

Invited paper for the AAAI'2000 Workshop on Imbalanced Data Sets.

Tree Induction for Probability-based Ranking (2008)

Provost, Foster, Domingos, Pedro

NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research

Discover Interesting Patterns for Investment Decision Making with GLOWER - A Genetic Learner Overlaid With Entropy Reduction (2008)

Dhar, Vasant, Chou, Dashin, Provost, Foster

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open ended by forcing us to consider...

Telecommunications Network Diagnosis (2008)

Danyluk, Andrea, Provost, Foster, Carr, Brian

The Scrubber 3 system monitors problems in the local loop of the telephone network, making automated decisions on tens of millions of cases a year, many of which lead to automated actions. Scrubber...

Social Network Collaborative Filtering (2008)

Zheng, Rong, Wilkinson, Dennis, Provost, Foster

This paper demonstrates that "social network collaborative filtering" (SNCF), wherein user-selected like-minded alters are used to make predictions, can rival traditional user-to-user collaborative...

Classification in Networked Data 0: A toolkit and a univariate case study (2008)

Sofus A. Macskassy, Foster Provost

This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...

ABSTRACT Economical Active Feature-value Acquisition through Expected Utility Estimation (2008)

Prem Melville, Foster Provost

In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively...

Abstract Efficient Progressive Sampling (2008)

Foster Provost

Having access to massive amounts of data does not neces-sarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less com-putational cost. However,...

Submitted to Machine Learning Active Sampling for Class Probability Estimation and Ranking (2008)

Maytal Saar-tsechansky, Foster Provost, Maytal Saar-tsechansky, Foster Provost

In many cost-sensitive environments class probabihty estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

A toolkit and a univariate case study (2008)

Sofus A. Macskassy, Foster Provost

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....

Social Network Collaborative Filtering (2008)

Rong Zheng, Foster Provost, Anindya Ghose

This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit data on consumers ’ social networks. To our knowledge it is the first study to...

Viral Marketing: Identifying likely adopters via consumer networks (2008)

Shawndra Hill, Foster Provost, Chris Volinsky

We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...

Maytal Saar-Tsechansky and (2008)

Foster Provost, Maytal Saar-tsechansky

For many supervised leaming tasks, the cost of acquir-ing training data is dominated by the cost of class la-beling. In this work, we explore active learning for class probability estimation (CPE)....

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (2008)

Sheng, Victor, Provost, Foster, Ipeirotis, Panagiotis G.

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (2008)

Sheng, Victor, Provost, Foster, Ipeirotis, Panagiotis G.

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...

Handling missing values when applying classification models (2008)

Maytal Saar-tsechansky, Foster Provost, Rich Caruana

Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This...

The Gift of Gab: Evidence TelE-Commerce Firms Can Profit from Viral Marketing (2008)

Shawndra Hill, Foster Provost, Chris Volinsky

existingcustomersinuencetherateofproductdiusion.The Inparticular,weobservehowthecommunicationnetworksof presentanaturaltestbedforviralmarketingmodelsbecause...

Suspicion scoring of networked entities based on guilt-by-association, collective inference, and focused data access 1 (2008)

Sofus A. Macskassy, Foster Provost, Sofus A. Macskassy

We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world...

targeting (2008)

Foster Provost

acquisition and cost-effective predictive modeling:

Forthcoming in Information Systems Research Decision-centric Active Learning of Binary-Outcome Models (2008)

Maytal Saar-tsechansky, Foster Provost

It can be expensive to acquire the data required for businesses to employ data-driven predictive modeling, for example to model consumer preferences to optimize targeting. Prior research has...

Suspicion scoring of networked entities based on guilt-by-association, collective inference, and focused data access 1 (2008)

Sofus A. Macskassy, Foster Provost, Sofus A. Macskassy

We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world...

Handling Missing Values when Applying Classi…cation Models (2008)

Maytal Saar-tsechansky, Foster Provost, Rich Caruana

Much work has studied the e¤ect of di¤erent treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This...

Abstract (2008)

Rong Zheng, Foster Provost, Anindya Ghose

This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit consumers ’ social networks. As user-generated social networks become...

A Brief Survey of Machine Learning Methods for Classification in Networked Data and an Application to Suspicion Scoring (2008)

Sofus Attila Macskassy, Foster Provost

Abstract. This paper surveys work from the field of machine learning on the problem of within-network learning and inference. To give motivation and context to the rest of the survey, we start by...

Appears in User Modeling 2001 Workshop: Machine Learning, Information Retrieval and User Modeling Information Triage using Prospective Criteria (2008)

Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan, Vasant Dhar

Abstract: In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of...

on Knowledge and Data Engineering Towards Intelligent Assistance for a Data Mining Process: An Ontology-based Approach for Cost-sensitive Classification 1 (2008)

Abraham Bernstein, Foster Provost, Shawndra Hill

A data mining (DM) process involves multiple stages. A simple, but typical, process might in-clude preprocessing data, applying a data-mining algorithm, and postprocessing the mining re-sults. There...

Suspicion scoring of networked entities based on guilt-by-association, (2008)

Collective Inference And, Sofus A. Macskassy, Foster Provost, Sofus A. Macskassy

We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world...

Get Another Label? Improving Data Quality and Data Mining (2008)

Sheng, Victor, Provost, Foster, Ipeirotis, Panagiotis

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...

Abstract: (2007)

Foster Provost, Pedro Domingos

Decision trees are one of the most eective and widely used classication methods. However, many applications require class probability estimates, and probability estimation trees (PETs) have the same...

H1.9.1 Telecommunications Network Diagnosis (2007)

Andrea Pohoreckyj Danyluk, Foster Provost

The Scrubber 3 system monitors problems in the local loop of the telephone network, making automated decisions on tens of millions of cases a year, many of which lead to automated actions. Scrubber...

Abstract (2007)

Foster Provost, David Jensen, Tim Oates

Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the...

Submitted 12/21/02 Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction (2007)

Gary M. Weiss, Foster Provost

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the...

Social Network Collaborative Filtering (2007)

Zheng, Rong, Provost, Foster, Ghose, Anindya

This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit data on consumers’ social networks. To our knowledge it is the first study to...

Social Network Collaborative Filtering (2007)

Zheng, Rong, Provost, Foster, Ghose, Anindya

This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit data on consumers’ social networks. To our knowledge it is the first study to...

Learning and Inference in Massive Social Networks (2007)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

Researchers and practitioners increasingly are gaining access to data on explicit social networks. For example, telecommunications and technology firms record data on consumer networks (via phone...

Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce (2007)

Provost, Foster, Melville, Prem, Saar-Tsechansky, Maytal

Electronic commerce is revolutionizing the way we think about data modeling, by making it possible to integrate the processes of (costly) data acquisition and model induction. The opportunity for...

Handling Missing Values when Applying Classification Models (2007)

Saar-Tsechansky, Maytal, Provost, Foster

Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This...

Modeling Complex Networks For (Electronic) Commerce (2007)

Provost, Foster, Sundararajan, Arun

NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research

Classification in Networked Data: A Toolkit and a Univariate Case Study (2007)

Mcskassy, Sofus, Provost, Foster

This paper1 is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...

Decision-centric Active Learning of Binary-Outcome Models (2007)

Saar-Tsechansky, Maytal, Provost, Foster

It can be expensive to acquire the data required for businesses to employ data-driven predictive modeling, for example to model consumer preferences to optimize targeting. Prior research has...

Modeling Complex Networks For (Electronic) Commerce (2007)

Foster Provost, Arun Sundararajan

Why do networks matter in commerce? � What are examples of “large sets of irregularly connected entities ” we observe as a consequence of (electronic) commerce? (intentionally blank) Why are...

Network-Based Marketing: Identifying Likely Adopters via Consumer Networks (2006)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using direct...

Network-Based Marketing: Identifying Likely Adopters via Consumer Networks (2006)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using direct...

Distribution-based aggregation for relational learning with identifier attributes (2006)

Perlich, Claudia, Provost, Foster

Identifier attributes—very high-dimensional categorical attributes such as particular product ids or people’s names—rarely are incorporated in statistical modeling. However, they can play an...

A Simple Relational Classifier (2006)

Macskassy, Sofus A., Provost, Foster

We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model the predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We...

Classification in Networked Data: A Toolkit and a Univariate Case Study (2006)

Macskassy, Sofus A., Provost, Foster

This paper presents NetKit, a modular toolkit for classification in networked data, and a casestudy of its application to networked data used in prior machine learning research. We consider...

with identifier attributes (2006)

Claudia Perlich, Foster Provost, C. Perlich, F. Provost

Distribution-based aggregation for relational learning

Network-based marketing: Identifying likely adopters via consumer networks (2006)

Shawndra Hill, Foster Provost, Chris Volinsky

Abstract. Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using...

Classification in networked data (2006)

A Toolkit, Sofus A. Macskassy, Foster Provost, Andrew Mccallum

This paper 1 is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...

Active feature-value acquisition (2006)

Maytal Saar-tsechansky, Prem Melville, Foster Provost

Most induction algorithms for building predictive models take as input training data in the form of feature vectors. Acquir-ing the values of features may be costly, and simply acquiring all values...

Economical Active Feature-value Acquisition through Expected Utility Estimation (2005)

Melville, Prem, Saar-Tsechansky, Maytal, Mooney, Raymond, Provost, Foster

In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively...

The Gift of Gab: Evidence TelE-Commerce Firms Can Profit from Viral Marketing (2005)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

Viral or buzz marketing takes advantage of communication linkages to propagate positive influence regarding a product or service. TelE-commerce is an ideal domain within which to study viral...

Towards Intelligent Assistance for a Data Mining Process:- (2005)

Provost, Foster, Hill, Shawndra, Bernstein, Abraham

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)

Perlich, Claudia, Provost, Foster

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...

ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)

Perlich, Claudia, Provost, Foster

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...

ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)

Perlich, Claudia, Provost, Foster

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...

ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)

Perlich, Claudia, Provost, Foster

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...

ROC Confidence Bands: An Empirical Study (2005)

Mcskassy, Sofus, Provost, Foster, Rosset, Saharon

This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...

Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...

ROC Confidence Bands: An Empirical Study (2005)

Mcskassy, Sofus, Provost, Foster, Rosset, Saharon

This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...

Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...

ROC Confidence Bands: An Empirical Study (2005)

Mcskassy, Sofus, Provost, Foster, Rosset, Saharon

This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...

Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...

ROC Confidence Bands: An Empirical Study (2005)

Mcskassy, Sofus, Provost, Foster, Rosset, Saharon

This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...

Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)

Hill, Shawndra, Provost, Foster, Volinsky, Chris

We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...

ROC Confidence Bands: An Empirical Evaluation (2005)

Macskassy, Sofus, Provost, Foster, Rosset, Saharon

This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...

Suspicion scoring based on guilt-by-association, collective inference, and focused (2005)

Mcskassy, Sofus, Provost, Foster

We describe a guilt-by-association system that can be used to rank entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terroristworld simulator...

Pointwise ROC Confidence Bounds: An Empirical Evaluation (2005)

Macskassy, Sofus, Provost, Foster, Rosset, Saharon

This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine...

Pointwise ROC Confidence Bounds: An Empirical Evaluation (2005)

Sofus A. Macskassy, Foster Provost, Saharon Rosset

This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine...

ROC Confidence Bands: An Empirical Evaluation (2005)

Sofus A. Macskassy, Foster Provost, Saharon Rosset

This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...

and its use for classification of networked data (2005)

Th Street New, Sofus A. Macskassy, Sofus A. Macskassy, Sofus A. Macskassy, Foster Provost, Foster Provost

This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease...

ROC Confidence Bands: An Empirical Evaluation (2005)

Sofus A. Macskassy, Foster Provost, Saharon Rosset

This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...

Economical Active Feature-value Acquisition through Expected Utility Estimation (2005)

Prem Melville, Foster Provost

In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively...

ROC Confidence Bands: An Empirical Evaluation (2005)

Sofus A. Macskassy, Foster Provost, Saharon Rosset

This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...

Pointwise ROC Confidence Bounds: An Empirical Evaluation (2005)

Sofus A. Macskassy, Foster Provost, Saharon Rosset

This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine...

NetKit-SRL: A Toolkit for Network Learning and Inference (2005)

Sofus A. Macskassy, Sofus A. Macskassy, Sofus A. Macskassy, Foster Provost, Foster Provost

This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease...

NetKit-SRL: A Toolkit for Network Learning and Inference (2005)

Sofus A. Macskassy, Sofus A. Macskassy, Sofus A. Macskassy, Foster Provost, Foster Provost

This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease...

Active Learning for Decision Making (2004)

Saar-Tsechansky, Maytal, Provost, Foster

This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...

Active Learning for Decision Making (2004)

Saar-Tsechansky, Maytal, Provost, Foster

This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...

Active Learning for Decision Making (2004)

Saar-Tsechansky, Maytal, Provost, Foster

This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...

Active Learning for Decision Making (2004)

Saar-Tsechansky, Maytal, Provost, Foster

This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...

Active Feature-Value Acquisition for Classifier Induction (2004)

Melville, Prem, Saar-Tsechansky, Maytal, Provost, Foster, Mooney, Raymond

Many induction problems include missing data that can be acquired at a cost. For building accurate predictive models, acquiring complete information for all instances is often expensive or...

Confidence Bands for ROC Curves: Methods and an Empirical Study (2004)

Macskassy, Sofus, Provost, Foster

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...

Confidence Bands for Roc Curves (2004)

Macskassy, Sofus, Provost, Foster

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...

Simple Models and Classification in Networked Data (2004)

Macskassy, Sofus, Provost, Foster

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...

Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)

Macskassy, Sofus, Provost, Foster

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....

Confidence Bands for Roc Curves (2004)

Macskassy, Sofus, Provost, Foster

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...

Simple Models and Classification in Networked Data (2004)

Macskassy, Sofus, Provost, Foster

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...

Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)

Macskassy, Sofus, Provost, Foster

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....

Confidence Bands for Roc Curves (2004)

Macskassy, Sofus, Provost, Foster

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...

Simple Models and Classification in Networked Data (2004)

Macskassy, Sofus, Provost, Foster

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...

Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)

Macskassy, Sofus, Provost, Foster

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....

Confidence Bands for Roc Curves (2004)

Macskassy, Sofus, Provost, Foster

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...

Simple Models and Classification in Networked Data (2004)

Macskassy, Sofus, Provost, Foster

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...

Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)

Macskassy, Sofus, Provost, Foster

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....

Active Sampling for Class Probability Estimation and Ranking (2004)

Provost, Foster, Saar-Tsechansky, Maytal

Abstract. In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to...

Active Sampling for Class Probability Estimation and Ranking (2004)

Maytal Saar-tsechansky, Foster Provost

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

Active Feature-Value Acquisition for Classifier Induction (2004)

Prem Melville, Maytal Saar-tsechansky, Foster Provost, Raymond Mooney

Many induction problems, such as on-line customer profiling, include missing data that can be acquired at a cost, such as incomplete customer information that can be filled in by an intermediary. For...

Simple Models and Classification in Networked Data (2004)

Sofus A. Macskassy, Foster Provost

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...

Simple Models and Classification in Networked Data (2004)

Sofus A. Macskassy, Foster Provost

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...

Distribution-based aggregation for relational learning with identifier attributes (2004)

Claudia Perlich, Foster Provost

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction (2003)

Weiss, Gary, Provost, Foster

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples...

Aggregation-Based Feature Invention and Relational Concept Classes (2003)

Perlich, Claudia, Provost, Foster

Model induction from relational data requires aggregation of values of attributes of related entities. This paper makes three contributions to the study of relational learning.(1) It presents a...

The Myth of the Double-Blind Review? Author Identification Using Only Citations (2003)

Provost, Foster, Hill, Shawndra

Prior studies have questioned the degree of anonymity of the double-blind review process for scholarly research articles. For example, one study based on a survey of reviewers concluded that authors...

Predicting citation rates for physics papers: Constructing features for an ordered probit model (2003)

Perlich, Claudia, Provost, Foster, Macskassy, Sofus

Gehrke et al. introduce the citation prediction task in their paper "Overview of the KDD Cup 2003" (in this issue). The objective was to predict the change in the number of citations a paper will...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2003)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

The Relational Vector-space Model (2003)

Bernstein, Abraham, Clearwater, Scott, Provost, Foster

This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...

Aggregation-Based Feature Invention and Relational (2003)

Perlich, Claudia, Provost, Foster

Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...

Confidence Bands for ROC Curves (2003)

Macskassy, Sofus A., Provost, Foster, Littman, Michael L.

We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...

The Relational Vector-space Model (2003)

Bernstein, Abraham, Clearwater, Scott, Provost, Foster

This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...

Aggregation-Based Feature Invention and Relational (2003)

Perlich, Claudia, Provost, Foster

Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...

Confidence Bands for ROC Curves (2003)

Macskassy, Sofus A., Provost, Foster, Littman, Michael L.

We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...

The Relational Vector-space Model (2003)

Bernstein, Abraham, Clearwater, Scott, Provost, Foster

This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...

Aggregation-Based Feature Invention and Relational (2003)

Perlich, Claudia, Provost, Foster

Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...

Confidence Bands for ROC Curves (2003)

Macskassy, Sofus A., Provost, Foster, Littman, Michael L.

We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...

The Relational Vector-space Model (2003)

Bernstein, Abraham, Clearwater, Scott, Provost, Foster

This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...

Aggregation-Based Feature Invention and Relational (2003)

Perlich, Claudia, Provost, Foster

Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...

Confidence Bands for ROC Curves (2003)

Macskassy, Sofus A., Provost, Foster, Littman, Michael L.

We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...

Tree induction for probability-based ranking (2003)

Foster Provost, Pedro Domingos

Abstract. Tree induction is one of the most eective and widely used methods for building classication models. However, many applications require cases to be ranked by the probability of class...

Tree Induction vs. Logistic Regression: A Learning-curve Analysis (2003)

Claudia Perlich, Foster Provost, Jeffrey S. Simonoff, William Cohen

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

A Simple Relational Classifier (2003)

Sofus A. Macskassy, Foster Provost

We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We...

Journal of Artificial Intelligence Research 19 (2003) 315-354 Submitted 12//02; published 10/03 2003 AI Access Foundation and Morgan Kaufmann Publishers. All Rights Reserved. (2003)

Learning When Training, Gary M. Weiss, Foster Provost

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples...

Tree Induction vs. Logistic Regression: A Learning-curve Analysis (2003)

Claudia Perlich, Foster Provost, Jeffrey S. Simonoff, William Cohen

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2003)

Claudia Perlich, Foster Provost, Jeffrey S. Simonoff, William Cohen

Tree induction and logistic regression are two standard, o-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree...

A simple relational classifier (2003)

Sofus A. Macskassy, Foster Provost

Abstract. We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent...

on Statistical Relational Learning and its Connections to Other Fields (2003)

Tom Dietterich, David Heckerman, Foster Provost

This workshop is the third in a series of workshops held in conjunction with AAAI and IJCAI. The first workshop was held in July, 2000 at AAAI. Notes from that workshop are available at

Learning when training data are costly: The effect of class distribution on tree induction (2003)

Gary M. Weiss, Foster Provost

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples...

Tree Induction vs. Logistic Regression: (2003)

Learning Curve Analysis, Claudia Perlich, Foster Provost, Jerey S. Simono

Tree induction and logistic regression are two standard, o{the{shelf methods for building models for classi cation. We present a large{scale experimental comparison of logistic regression and tree...

The relational vector-space model and industry classification (2003)

Abraham Bernstein, Abraham Bernstein, Scott Clearwater, Scott Clearwater, Foster Provost, Foster Provost

This paper addresses the classification of linked entities. We introduce a relational vectnr-space (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked...

Relational Learning Problems and Simple Models (2003)

Foster Provost, Claudia Perlich, Sofus A. Macskassy

In recent years, we have seen remarkable advances in algorithms for relational learning, especially statistically based algorithms. These algorithms have been developed in a wide variety of different...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)

Bernstein, Abraham, Hill, Shawndra, Provost, Foster

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

Discovering Knowledge from Relational Data Extracted from Business News (2002)

Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster

Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...

Intelligent Assistance for the Data Mining Process: . . . (2002)

Abraham Bernstein, Abraham Bernstein, Foster Provost, Foster Provost, Shawndra Hill, Shawndra Hill

A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...

An intelligent assistant for the knowledge discovery process (2002)

Abraham Bernstein, Abraham Bernstein, Foster Provost, Foster Provost

Abstract attributes). However. Process 2 is valid. because it pre-A knowledge discovery (KD) process involves pre- processes the data with a discretization routine, transformprocessing data, choosing...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)

Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...

Robust Classification for Imprecise Environments (2001)

Provost, Foster, Fawcett, Tom

In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification...

AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)

Bernstein, Abraham, Provost, Foster

A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...

Active Sampling for Class Probability Estimation and Ranking (2001)

Saar-Tsechansky, Maytal, Provost, Foster

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)

Bernstein, Abraham, Provost, Foster

A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...

Active Sampling for Class Probability Estimation and Ranking (2001)

Saar-Tsechansky, Maytal, Provost, Foster

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)

Bernstein, Abraham, Provost, Foster

A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...

Active Sampling for Class Probability Estimation and Ranking (2001)

Saar-Tsechansky, Maytal, Provost, Foster

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)

Bernstein, Abraham, Provost, Foster

A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...

Active Sampling for Class Probability Estimation and Ranking (2001)

Saar-Tsechansky, Maytal, Provost, Foster

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

Active -Learning for Decision-Making (2001)

Maytal Saar-tsechansky, Foster Provost

This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...

Active learning for class probability estimation and ranking (2001)

Maytal Saar-tsechansky, Foster Provost

For many supervised learning tasks it is very costly to produce training data with class labels. Active learning acquires data incrementally, at each stage using the model learned so far to help...

Intelligent Information Triage (2001)

Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan, Vasant Dhar

In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective...

Applications of data mining to electronic commerce (2001)

Ron Kohavi, Foster Provost

Electronic commerce is emerging as the killer domain for data mining technology. Is there support for such a bold statement? Data mining techologies have been around for decades, without moving...

The Effect of Class Distribution on Classifier Learning: An Empirical Study (2001)

Gary M. Weiss, Foster Provost

In this article we analyze the effect of class distribution on classifier learning. We begin by describing the different ways in which class distribution affects learning and how it affects the...

The Effect of Class Distribution on Classifier Learning: An Empirical Study (2001)

Gary M. Weiss, Foster Provost

Many of today's large data sets must be reduced in size before invoking inductive algorithms, due to the costs associated with procuring/processing the data, and because most of these algorithms...

Information Triage using Prospective Criteria (2001)

Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan, Vasant Dhar

: In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective...

Active Sampling for Class Probability Estimation and Ranking (2001)

Maytal Saar-tsechansky, Foster Provost

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...

Active -Learning for Decision-Making (2001)

Maytal Saar-tsechansky, Foster Provost

This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...

Applications of Data Mining to Electronic Commerce (2000)

Kohavi, Ron, Provost, Foster

Electronic commerce is emerging as the killer domain for data mining technology. The following are five desiderata for success. Seldom are they they all present in one data mining application. 1....

Robust Classification for Imprecise Environments (2000)

Provost, Foster, Fawcett, Tom

In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification...

DISCOVERING INTERESTING PATTERNS FOR INVESTMENT DECISION MAKING WITH GLOWER C - A GENETIC LEARNER OVERLAID WITH ENTROPY REDUCTION (2000)

Dhar, Vasant, Chou, Dashin, Provost, Foster

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...

DISCOVERING INTERESTING PATTERNS FOR INVESTMENT DECISION MAKING WITH GLOWER C - A GENETIC LEARNER OVERLAID WITH ENTROPY REDUCTION (2000)

Dhar, Vasant, Chou, Dashin, Provost, Foster

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...

DISCOVERING INTERESTING PATTERNS FOR INVESTMENT DECISION MAKING WITH GLOWER C - A GENETIC LEARNER OVERLAID WITH ENTROPY REDUCTION (2000)

Dhar, Vasant, Chou, Dashin, Provost, Foster

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...

DISCOVERING INTERESTING PATTERNS FOR INVESTMENT DECISION MAKING WITH GLOWER C - A GENETIC LEARNER OVERLAID WITH ENTROPY REDUCTION (2000)

Dhar, Vasant, Chou, Dashin, Provost, Foster

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...

Variance-based Active Learning (2000)

Saar-Tsechansky, Maytal, Provost, Foster

For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....

Variance-based Active Learning (2000)

Saar-Tsechansky, Maytal, Provost, Foster

For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....

Variance-based Active Learning (2000)

Saar-Tsechansky, Maytal, Provost, Foster

For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....

Variance-based Active Learning (2000)

Saar-Tsechansky, Maytal, Provost, Foster

For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....

Discovering Interesting Patterns for Investment Decision Making with GLOWER - A Genetic Learner Overlaid With Entropy Reduction (2000)

Vasant Dhar, Dashin Chou, Foster Provost

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open ended by forcing us to consider...

Robust classification for imprecise environments (2000)

Foster Provost, Tom Fawcett

In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification...

Discovering Interesting Patterns for Investment Decision Making with GLOWER – A Genetic Learner Overlaid With Entropy Reduction. Data Mining and Knowledge Discovery 4(4 (2000)

Vasant Dhar, Dashin Chou, Dashin Chou, Foster Provost, Foster Provost

Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...

Activity monitoring: Noticing interesting changes in behavior (1999)

Tom Fawcett, Foster Provost

We introduce a problem class which we term activity monitoring. Such problems involve monitoring the behavior of a large population of entities for interesting events requiring action. We present a...

Rule-space Search for Knowledge-based Discovery (1999)

Foster Provost, John M. Aronis, Bruce G. Buchanan

Because the knowledge discovery process is ill-defined, iterative, and requires intense interaction, algorithm flexibility is crucial. In this paper, we present a straighforward, heuristic...

Efficient Progressive Sampling (1999)

Foster Provost, David Jensen, Tim Oates

Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the...

A Survey of Methods for Scaling Up Inductive Algorithms (1999)

Foster Provost, Venkateswarlu Kolluri

. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work...

Problem Definition, Data Cleaning, and Evaluation: A Classifier Learning Case Study (1999)

Foster Provost, Andrea Pohoreckyj Danyluk

This paper is a case study of this process based on a long-term project addressing the automatic dispatch of technicians to fix faults in the local loop of a telephone network. The bottom line of the...

A Survey of Methods for Scaling Up Inductive Algorithms (1999)

Foster Provost, Venkateswarlu Kolluri, Usama Fayyad

. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. By collecting, categorizing, and summarizing existing work on...

Problem Definition, Data Cleaning, and Evaluation: A Classifier Learning Case Study (1999)

Foster Provost, Andrea Pohoreckyj Danyluk

This paper is a case study of this process based on a long-term project addressing the automatic dispatch of technicians to fix faults in the local loop of a telephone network. The bottom line of the...

Efficient Progressive Sampling (1999)

Foster Provost, David Jensen, Tim Oates

Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the...

Distributed Data Mining: Scaling up and beyond (1999)

Foster Provost

In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey...

Distributed Fault Tolerant Embeddings of Binary Trees in Hypercubes. (1998)

Provost, Foster, Melhem, Rami

This paper presents a distributed algorithm for embedding binary trees in hypercubes. Starting with the root (invoked in some cube node by a host), each node is responsible for determining the...

Classificaiton in Networked Data: A Toolkit and a Univariate Case Study (1998)

Macskassy, Softus A., Provost, Foster

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....

Monitoring Business Activity (1998)

Provost, Foster, Macskassy, Sofus

Under this project, the authors studied and developed technologies to "score" entities to build models that will produce an estimate of the likelihood that an entity exhibits some characteristic. For...

The Case Against Accuracy Estimation for Comparing Induction Algorithms (1998)

Foster Provost

We analyze critically the use of classi cation accuracy to compare classi ers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and...

Pharmacophore discovery using the inductive logic programming system PROGOL (1998)

Paul Finn, David Page, Ronny Kohavi, Foster Provost

Abstract. This paper is a case study of a machine aided knowledge discovery process within the general area of drug design. More specifically, the paper describes a sequence of experiments in which...

On Applied Research in Machine Learning (1998)

Foster Provost, Ron Kohavi

bstacles that impede their practical application. Most often these obstacles take the form of restrictive simplifying assumptions commonly made in research. Consider as an example the assumption,...

The Case Against Accuracy Estimation for Comparing Induction Algorithms (1998)

Foster Provost, Tom Fawcett, Ron Kohavi

We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and...

Pharmacophore Discovery using the Inductive Logic Programming System Progol (1998)

Paul Finn, David Page, Ron Kohavi, Foster Provost

. This paper presents a case study of a machine-aided knowledge discovery process within the general area of drug design. Within drug design, the particular problem of pharmacophore discovery is...

Robust Classification Systems for Imprecise Environments (1998)

Foster Provost, Tom Fawcett

In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We show that it is...

Pharmacophore Discovery using the Inductive Logic Programming System Progol (1998)

Paul Finn, David Page, Ronny Kohavi, Foster Provost

. This paper is a case study of a machine aided knowledge discovery process within the general area of drug design. More specifically, the paper describes a sequence of experiments in which an...

Machine Learning for the Detection of Oil Spills in Satellite Radar Images (1998)

Miroslav Kubat, Robert C. Holte, Stan Matwin, Ron Kohavi, Foster Provost

During a project examining the use of machine learning techniques for oil spill detection, we encountered several essential questions that we believe deserve the attention of the research community....

The Case Against Accuracy Estimation for Comparing Induction Algorithms (1998)

Foster Provost, Tom Fawcett, Ron Kohavi

We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and...

Learning in the 'Real World' (1998)

Lorenza Saitta, Filippo Neri, Ronny Kohavi, Foster Provost

. In this paper we define and characterize the process of developing a "real-world" Machine Learning application, with its difficulties and relevant issues, distinguishing it from the...

Robust Classification Systems for Imprecise Environments (1998)

Foster Provost, Tom Fawcett

In real-world environments, it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We show that it is...

Adaptive fraud detection. Data Mining and Knowledge Discovery (1997)

Tom Fawcett, Foster Provost

Abstract. One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection,...

Scaling Up inductive algorithms: An overview (1997)

Foster Provost, Venkateswarlu Kolluri

This paper establishes common ground for researchers addressing the challenge of scaling up inductive data mining algorithms to very large databases, and for practitioners who want to understand the...

Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions (1997)

Foster Provost, Tom Fawcett

Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions...

Adaptive Fraud Detection (1997)

Tom Fawcett, Foster Provost

. One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a...

A Survey of Methods for Scaling Up Inductive Learning Algorithms (1997)

Foster J. Provost, Venkateswarlu Kolluri, Foster Provost

: Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By...

Combining Data Mining and Machine Learning for Effective User Profiling (1996)

Tom Fawcett, Foster Provost

This paper describes the automatic design of methods for detecting fraudulent behavior. Much of the design is accomplished using a series of machine learning methods. In particular, we combine data...

Robust Classification for Imprecise Environments (1989)

Foster Provost, Tom Fawcett

In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for...

Rule-space search for knowledge-based discovery (1001)

Foster Provost, John M. Aronis, Bruce G. Buchanan

Because the knowledge discovery process is ill-dened, iterative, and requires intense interaction, algorithm exibility is crucial. In this paper, we present a straighforward, heuristic...