Submitted to Management Science manuscript MS-00665-2006 Active Feature-Value Acquisition (2009)
Maytal Saar-tsechansky, Prem Melville, Foster Provost
Most induction algorithms for building predictive models take as input training data in the form of feature vectors. Acquiring the values of features may be costly, and simply acquiring all values...
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (2009)
Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...
Social network collaborative filtering (2009)
Rong Zheng, Dennis Wilkinson, Foster Provost
This paper demonstrates that "social network collaborative filtering " (SNCF), wherein user-selected like-minded alters are used to make predictions, can rival traditional user-to-user...
Electronic commerce is revolutionizing the way we think about data modeling, by making it possible to integrate the processes of (costly) data acquisition and model induction. The opportunity for...
A Toolkit, Sofus A. Macskassy, Foster Provost
This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...
Network-Based Marketing: Identifying Likely Adopters via Consumer Networks (2008)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using direct...
A Simple Relational Classifier (2008)
Provost, Foster, Macskassy, Sofus
We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes.We...
Machine Learning from Imbalanced Data Sets 101 (2008)
Invited paper for the AAAI'2000 Workshop on Imbalanced Data Sets.
Tree Induction for Probability-based Ranking (2008)
Provost, Foster, Domingos, Pedro
NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research
Dhar, Vasant, Chou, Dashin, Provost, Foster
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open ended by forcing us to consider...
Telecommunications Network Diagnosis (2008)
Danyluk, Andrea, Provost, Foster, Carr, Brian
The Scrubber 3 system monitors problems in the local loop of the telephone network, making automated decisions on tens of millions of cases a year, many of which lead to automated actions. Scrubber...
Social Network Collaborative Filtering (2008)
Zheng, Rong, Wilkinson, Dennis, Provost, Foster
This paper demonstrates that "social network collaborative filtering" (SNCF), wherein user-selected like-minded alters are used to make predictions, can rival traditional user-to-user collaborative...
Classification in Networked Data 0: A toolkit and a univariate case study (2008)
Sofus A. Macskassy, Foster Provost
This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...
ABSTRACT Economical Active Feature-value Acquisition through Expected Utility Estimation (2008)
In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively...
Abstract Efficient Progressive Sampling (2008)
Having access to massive amounts of data does not neces-sarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less com-putational cost. However,...
Submitted to Machine Learning Active Sampling for Class Probability Estimation and Ranking (2008)
Maytal Saar-tsechansky, Foster Provost, Maytal Saar-tsechansky, Foster Provost
In many cost-sensitive environments class probabihty estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
A toolkit and a univariate case study (2008)
Sofus A. Macskassy, Foster Provost
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....
Social Network Collaborative Filtering (2008)
Rong Zheng, Foster Provost, Anindya Ghose
This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit data on consumers ’ social networks. To our knowledge it is the first study to...
Viral Marketing: Identifying likely adopters via consumer networks (2008)
Shawndra Hill, Foster Provost, Chris Volinsky
We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...
Maytal Saar-Tsechansky and (2008)
Foster Provost, Maytal Saar-tsechansky
For many supervised leaming tasks, the cost of acquir-ing training data is dominated by the cost of class la-beling. In this work, we explore active learning for class probability estimation (CPE)....
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (2008)
Sheng, Victor, Provost, Foster, Ipeirotis, Panagiotis G.
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (2008)
Sheng, Victor, Provost, Foster, Ipeirotis, Panagiotis G.
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...
Handling missing values when applying classification models (2008)
Maytal Saar-tsechansky, Foster Provost, Rich Caruana
Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This...
The Gift of Gab: Evidence TelE-Commerce Firms Can Profit from Viral Marketing (2008)
Shawndra Hill, Foster Provost, Chris Volinsky
existingcustomersinuencetherateofproductdiusion.The Inparticular,weobservehowthecommunicationnetworksof presentanaturaltestbedforviralmarketingmodelsbecause...
Sofus A. Macskassy, Foster Provost, Sofus A. Macskassy
We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world...
A brief survey of machine learning methods for classification in networked data and (2008)
Sofus A. Macskassy, Foster Provost
an application to suspicion scoring
Maytal Saar-tsechansky, Foster Provost
It can be expensive to acquire the data required for businesses to employ data-driven predictive modeling, for example to model consumer preferences to optimize targeting. Prior research has...
Sofus A. Macskassy, Foster Provost, Sofus A. Macskassy
We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world...
Handling Missing Values when Applying Classi…cation Models (2008)
Maytal Saar-tsechansky, Foster Provost, Rich Caruana
Much work has studied the e¤ect of di¤erent treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This...
Rong Zheng, Foster Provost, Anindya Ghose
This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit consumers ’ social networks. As user-generated social networks become...
Sofus Attila Macskassy, Foster Provost
Abstract. This paper surveys work from the field of machine learning on the problem of within-network learning and inference. To give motivation and context to the rest of the survey, we start by...
Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan, Vasant Dhar
Abstract: In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of...
Abraham Bernstein, Foster Provost, Shawndra Hill
A data mining (DM) process involves multiple stages. A simple, but typical, process might in-clude preprocessing data, applying a data-mining algorithm, and postprocessing the mining re-sults. There...
Suspicion scoring of networked entities based on guilt-by-association, (2008)
Collective Inference And, Sofus A. Macskassy, Foster Provost, Sofus A. Macskassy
We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world...
Get Another Label? Improving Data Quality and Data Mining (2008)
Sheng, Victor, Provost, Foster, Ipeirotis, Panagiotis
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus...
Foster Provost, Pedro Domingos
Decision trees are one of the most eective and widely used classication methods. However, many applications require class probability estimates, and probability estimation trees (PETs) have the same...
H1.9.1 Telecommunications Network Diagnosis (2007)
Andrea Pohoreckyj Danyluk, Foster Provost
The Scrubber 3 system monitors problems in the local loop of the telephone network, making automated decisions on tens of millions of cases a year, many of which lead to automated actions. Scrubber...
Foster Provost, David Jensen, Tim Oates
Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the...
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the...
Social Network Collaborative Filtering (2007)
Zheng, Rong, Provost, Foster, Ghose, Anindya
This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit data on consumers’ social networks. To our knowledge it is the first study to...
Social Network Collaborative Filtering (2007)
Zheng, Rong, Provost, Foster, Ghose, Anindya
This paper reports on a preliminary empirical study comparing methods for collaborative filtering (CF) using explicit data on consumers’ social networks. To our knowledge it is the first study to...
Learning and Inference in Massive Social Networks (2007)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
Researchers and practitioners increasingly are gaining access to data on explicit social networks. For example, telecommunications and technology firms record data on consumer networks (via phone...
Provost, Foster, Melville, Prem, Saar-Tsechansky, Maytal
Electronic commerce is revolutionizing the way we think about data modeling, by making it possible to integrate the processes of (costly) data acquisition and model induction. The opportunity for...
Handling Missing Values when Applying Classification Models (2007)
Saar-Tsechansky, Maytal, Provost, Foster
Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This...
Modeling Complex Networks For (Electronic) Commerce (2007)
Provost, Foster, Sundararajan, Arun
NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research
Classification in Networked Data: A Toolkit and a Univariate Case Study (2007)
Mcskassy, Sofus, Provost, Foster
This paper1 is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...
Decision-centric Active Learning of Binary-Outcome Models (2007)
Saar-Tsechansky, Maytal, Provost, Foster
It can be expensive to acquire the data required for businesses to employ data-driven predictive modeling, for example to model consumer preferences to optimize targeting. Prior research has...
Modeling Complex Networks For (Electronic) Commerce (2007)
Foster Provost, Arun Sundararajan
Why do networks matter in commerce? � What are examples of “large sets of irregularly connected entities ” we observe as a consequence of (electronic) commerce? (intentionally blank) Why are...
Network-Based Marketing: Identifying Likely Adopters via Consumer Networks (2006)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using direct...
Network-Based Marketing: Identifying Likely Adopters via Consumer Networks (2006)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using direct...
Distribution-based aggregation for relational learning with identifier attributes (2006)
Perlich, Claudia, Provost, Foster
Identifier attributes—very high-dimensional categorical attributes such as particular product ids or people’s names—rarely are incorporated in statistical modeling. However, they can play an...
A Simple Relational Classifier (2006)
Macskassy, Sofus A., Provost, Foster
We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model the predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We...
Classification in Networked Data: A Toolkit and a Univariate Case Study (2006)
Macskassy, Sofus A., Provost, Foster
This paper presents NetKit, a modular toolkit for classification in networked data, and a casestudy of its application to networked data used in prior machine learning research. We consider...
with identifier attributes (2006)
Claudia Perlich, Foster Provost, C. Perlich, F. Provost
Distribution-based aggregation for relational learning
Network-based marketing: Identifying likely adopters via consumer networks (2006)
Shawndra Hill, Foster Provost, Chris Volinsky
Abstract. Network-based marketing refers to a collection of marketing techniques that take advantage of links between consumers to increase sales. We concentrate on the consumer networks formed using...
Classification in networked data (2006)
A Toolkit, Sofus A. Macskassy, Foster Provost, Andrew Mccallum
This paper 1 is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in...
Active feature-value acquisition (2006)
Maytal Saar-tsechansky, Prem Melville, Foster Provost
Most induction algorithms for building predictive models take as input training data in the form of feature vectors. Acquir-ing the values of features may be costly, and simply acquiring all values...
Economical Active Feature-value Acquisition through Expected Utility Estimation (2005)
Melville, Prem, Saar-Tsechansky, Maytal, Mooney, Raymond, Provost, Foster
In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively...
The Gift of Gab: Evidence TelE-Commerce Firms Can Profit from Viral Marketing (2005)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
Viral or buzz marketing takes advantage of communication linkages to propagate positive influence regarding a product or service. TelE-commerce is an ideal domain within which to study viral...
Towards Intelligent Assistance for a Data Mining Process:- (2005)
Provost, Foster, Hill, Shawndra, Bernstein, Abraham
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)
Perlich, Claudia, Provost, Foster
Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)
Perlich, Claudia, Provost, Foster
Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)
Perlich, Claudia, Provost, Foster
Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes (2005)
Perlich, Claudia, Provost, Foster
Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...
ROC Confidence Bands: An Empirical Study (2005)
Mcskassy, Sofus, Provost, Foster, Rosset, Saharon
This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...
Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...
ROC Confidence Bands: An Empirical Study (2005)
Mcskassy, Sofus, Provost, Foster, Rosset, Saharon
This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...
Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...
ROC Confidence Bands: An Empirical Study (2005)
Mcskassy, Sofus, Provost, Foster, Rosset, Saharon
This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...
Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...
ROC Confidence Bands: An Empirical Study (2005)
Mcskassy, Sofus, Provost, Foster, Rosset, Saharon
This paper is about constructing confidence bands around an ROC curve such that (1 - \delta)% of the ROC curves traced by data sets of size r will fall completely within the bands. We introduce to...
Viral Marketing: Identifying Likely Adopters Via Consumer Networks (2005)
Hill, Shawndra, Provost, Foster, Volinsky, Chris
We investigate the hypothesis: those consumers who have communicated with a customer of a particular service have increased likelihood of adopting the service. We survey the diverse literature on...
ROC Confidence Bands: An Empirical Evaluation (2005)
Macskassy, Sofus, Provost, Foster, Rosset, Saharon
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...
Suspicion scoring based on guilt-by-association, collective inference, and focused (2005)
Mcskassy, Sofus, Provost, Foster
We describe a guilt-by-association system that can be used to rank entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terroristworld simulator...
Pointwise ROC Confidence Bounds: An Empirical Evaluation (2005)
Macskassy, Sofus, Provost, Foster, Rosset, Saharon
This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine...
Abraham Bernstein, Foster Provost, Abraham Bernstein, Foster Provost, Shawndra Hill
For more information, please visit our website at
Pointwise ROC Confidence Bounds: An Empirical Evaluation (2005)
Sofus A. Macskassy, Foster Provost, Saharon Rosset
This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine...
ROC Confidence Bands: An Empirical Evaluation (2005)
Sofus A. Macskassy, Foster Provost, Saharon Rosset
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...
and its use for classification of networked data (2005)
Th Street New, Sofus A. Macskassy, Sofus A. Macskassy, Sofus A. Macskassy, Foster Provost, Foster Provost
This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease...
ROC Confidence Bands: An Empirical Evaluation (2005)
Sofus A. Macskassy, Foster Provost, Saharon Rosset
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...
Economical Active Feature-value Acquisition through Expected Utility Estimation (2005)
In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively...
ROC Confidence Bands: An Empirical Evaluation (2005)
Sofus A. Macskassy, Foster Provost, Saharon Rosset
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well...
Pointwise ROC Confidence Bounds: An Empirical Evaluation (2005)
Sofus A. Macskassy, Foster Provost, Saharon Rosset
This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidencebound methods, two from the medical field and two used previously in machine...
NetKit-SRL: A Toolkit for Network Learning and Inference (2005)
Sofus A. Macskassy, Sofus A. Macskassy, Sofus A. Macskassy, Foster Provost, Foster Provost
This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease...
NetKit-SRL: A Toolkit for Network Learning and Inference (2005)
Sofus A. Macskassy, Sofus A. Macskassy, Sofus A. Macskassy, Foster Provost, Foster Provost
This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease...
Active Learning for Decision Making (2004)
Saar-Tsechansky, Maytal, Provost, Foster
This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...
Active Learning for Decision Making (2004)
Saar-Tsechansky, Maytal, Provost, Foster
This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...
Active Learning for Decision Making (2004)
Saar-Tsechansky, Maytal, Provost, Foster
This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...
Active Learning for Decision Making (2004)
Saar-Tsechansky, Maytal, Provost, Foster
This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...
Active Feature-Value Acquisition for Classifier Induction (2004)
Melville, Prem, Saar-Tsechansky, Maytal, Provost, Foster, Mooney, Raymond
Many induction problems include missing data that can be acquired at a cost. For building accurate predictive models, acquiring complete information for all instances is often expensive or...
Confidence Bands for ROC Curves: Methods and an Empirical Study (2004)
Macskassy, Sofus, Provost, Foster
In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...
Confidence Bands for Roc Curves (2004)
Macskassy, Sofus, Provost, Foster
In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...
Simple Models and Classification in Networked Data (2004)
Macskassy, Sofus, Provost, Foster
When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...
Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)
Macskassy, Sofus, Provost, Foster
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....
Confidence Bands for Roc Curves (2004)
Macskassy, Sofus, Provost, Foster
In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...
Simple Models and Classification in Networked Data (2004)
Macskassy, Sofus, Provost, Foster
When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...
Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)
Macskassy, Sofus, Provost, Foster
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....
Confidence Bands for Roc Curves (2004)
Macskassy, Sofus, Provost, Foster
In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...
Simple Models and Classification in Networked Data (2004)
Macskassy, Sofus, Provost, Foster
When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...
Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)
Macskassy, Sofus, Provost, Foster
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....
Confidence Bands for Roc Curves (2004)
Macskassy, Sofus, Provost, Foster
In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning,...
Simple Models and Classification in Networked Data (2004)
Macskassy, Sofus, Provost, Foster
When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...
Classification in Networked Data: A Toolkit and a Univariate Case Study (2004)
Macskassy, Sofus, Provost, Foster
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....
Active Sampling for Class Probability Estimation and Ranking (2004)
Provost, Foster, Saar-Tsechansky, Maytal
Abstract. In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to...
Active Sampling for Class Probability Estimation and Ranking (2004)
Maytal Saar-tsechansky, Foster Provost
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
Active Feature-Value Acquisition for Classifier Induction (2004)
Prem Melville, Maytal Saar-tsechansky, Foster Provost, Raymond Mooney
Many induction problems, such as on-line customer profiling, include missing data that can be acquired at a cost, such as incomplete customer information that can be filled in by an intermediary. For...
Simple Models and Classification in Networked Data (2004)
Sofus A. Macskassy, Foster Provost
When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...
Simple Models and Classification in Networked Data (2004)
Sofus A. Macskassy, Foster Provost
When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that...
Distribution-based aggregation for relational learning with identifier attributes (2004)
Claudia Perlich, Foster Provost
Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of...
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction (2003)
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples...
Aggregation-Based Feature Invention and Relational Concept Classes (2003)
Perlich, Claudia, Provost, Foster
Model induction from relational data requires aggregation of values of attributes of related entities. This paper makes three contributions to the study of relational learning.(1) It presents a...
The Myth of the Double-Blind Review? Author Identification Using Only Citations (2003)
Provost, Foster, Hill, Shawndra
Prior studies have questioned the degree of anonymity of the double-blind review process for scholarly research articles. For example, one study based on a survey of reviewers concluded that authors...
Perlich, Claudia, Provost, Foster, Macskassy, Sofus
Gehrke et al. introduce the citation prediction task in their paper "Overview of the KDD Cup 2003" (in this issue). The objective was to predict the change in the number of citations a paper will...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2003)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
The Relational Vector-space Model (2003)
Bernstein, Abraham, Clearwater, Scott, Provost, Foster
This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...
Aggregation-Based Feature Invention and Relational (2003)
Perlich, Claudia, Provost, Foster
Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...
Confidence Bands for ROC Curves (2003)
Macskassy, Sofus A., Provost, Foster, Littman, Michael L.
We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...
The Relational Vector-space Model (2003)
Bernstein, Abraham, Clearwater, Scott, Provost, Foster
This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...
Aggregation-Based Feature Invention and Relational (2003)
Perlich, Claudia, Provost, Foster
Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...
Confidence Bands for ROC Curves (2003)
Macskassy, Sofus A., Provost, Foster, Littman, Michael L.
We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...
The Relational Vector-space Model (2003)
Bernstein, Abraham, Clearwater, Scott, Provost, Foster
This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...
Aggregation-Based Feature Invention and Relational (2003)
Perlich, Claudia, Provost, Foster
Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...
Confidence Bands for ROC Curves (2003)
Macskassy, Sofus A., Provost, Foster, Littman, Michael L.
We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...
The Relational Vector-space Model (2003)
Bernstein, Abraham, Clearwater, Scott, Provost, Foster
This paper addresses the classification of linked entities. We introduce a relational vector (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure,...
Aggregation-Based Feature Invention and Relational (2003)
Perlich, Claudia, Provost, Foster
Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based...
Confidence Bands for ROC Curves (2003)
Macskassy, Sofus A., Provost, Foster, Littman, Michael L.
We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating bands on ROC curves. Historically this has been done using...
Tree induction for probability-based ranking (2003)
Foster Provost, Pedro Domingos
Abstract. Tree induction is one of the most eective and widely used methods for building classication models. However, many applications require cases to be ranked by the probability of class...
Tree Induction vs. Logistic Regression: A Learning-curve Analysis (2003)
Claudia Perlich, Foster Provost, Jeffrey S. Simonoff, William Cohen
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
A Simple Relational Classifier (2003)
Sofus A. Macskassy, Foster Provost
We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We...
Learning When Training, Gary M. Weiss, Foster Provost
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples...
Tree Induction vs. Logistic Regression: A Learning-curve Analysis (2003)
Claudia Perlich, Foster Provost, Jeffrey S. Simonoff, William Cohen
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2003)
Claudia Perlich, Foster Provost, Jeffrey S. Simonoff, William Cohen
Tree induction and logistic regression are two standard, o-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree...
A simple relational classifier (2003)
Sofus A. Macskassy, Foster Provost
Abstract. We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent...
on Statistical Relational Learning and its Connections to Other Fields (2003)
Tom Dietterich, David Heckerman, Foster Provost
This workshop is the third in a series of workshops held in conjunction with AAAI and IJCAI. The first workshop was held in July, 2000 at AAAI. Notes from that workshop are available at
Learning when training data are costly: The effect of class distribution on tree induction (2003)
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples...
Tree Induction vs. Logistic Regression: (2003)
Learning Curve Analysis, Claudia Perlich, Foster Provost, Jerey S. Simono
Tree induction and logistic regression are two standard, o{the{shelf methods for building models for classi cation. We present a large{scale experimental comparison of logistic regression and tree...
The relational vector-space model and industry classification (2003)
Abraham Bernstein, Abraham Bernstein, Scott Clearwater, Scott Clearwater, Foster Provost, Foster Provost
This paper addresses the classification of linked entities. We introduce a relational vectnr-space (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked...
Tree Induction vs. Logistic Regression: A Learning-curve Analysis (2003)
Jeffrey S. Simonoff, Claudia Perlich, Claudia Perlich, Foster Provost, Foster Provost, Jeffrey S. Sirnonoff, ...
Relational Learning Problems and Simple Models (2003)
Foster Provost, Claudia Perlich, Sofus A. Macskassy
In recent years, we have seen remarkable advances in algorithms for relational learning, especially statistically based algorithms. These algorithms have been developed in a wide variety of different...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach (2002)
Bernstein, Abraham, Hill, Shawndra, Provost, Foster
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
Discovering Knowledge from Relational Data Extracted from Business News (2002)
Bernstein, Abraham, Clearwater, Scott, Hill, Shawndra, Perlich, Claudia, Provost, Foster
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated...
Intelligent Assistance for the Data Mining Process: . . . (2002)
Abraham Bernstein, Abraham Bernstein, Foster Provost, Foster Provost, Shawndra Hill, Shawndra Hill
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There...
An intelligent assistant for the knowledge discovery process (2002)
Abraham Bernstein, Abraham Bernstein, Foster Provost, Foster Provost
Abstract attributes). However. Process 2 is valid. because it pre-A knowledge discovery (KD) process involves pre- processes the data with a discretization routine, transformprocessing data, choosing...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs Logistic Regression: A Learning Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Statistics Working Papers Series
Tree Induction vs Logistic Regression: A Learning Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Statistics Working Papers Series
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Tree Induction vs. Logistic Regression: A Learning-Curve Analysis (2001)
Perlich, Claudia, Provost, Foster, Simonoff, Jeffrey S.
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree...
Robust Classification for Imprecise Environments (2001)
In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification...
AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)
Bernstein, Abraham, Provost, Foster
A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...
Active Sampling for Class Probability Estimation and Ranking (2001)
Saar-Tsechansky, Maytal, Provost, Foster
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)
Bernstein, Abraham, Provost, Foster
A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...
Active Sampling for Class Probability Estimation and Ranking (2001)
Saar-Tsechansky, Maytal, Provost, Foster
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)
Bernstein, Abraham, Provost, Foster
A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...
Active Sampling for Class Probability Estimation and Ranking (2001)
Saar-Tsechansky, Maytal, Provost, Foster
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
AN INTELLIGENT ASSISTANT FOR THE KNOWLEDGE DISCOVERY PROCESS (2001)
Bernstein, Abraham, Provost, Foster
A knowledge discovery (KD) process involves pre- data, choosing a data-mining algorithm, and postprocessing the mining results. There are very many choices for each of these stages, and non-trivial...
Active Sampling for Class Probability Estimation and Ranking (2001)
Saar-Tsechansky, Maytal, Provost, Foster
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
Active -Learning for Decision-Making (2001)
Maytal Saar-tsechansky, Foster Provost
This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...
Active learning for class probability estimation and ranking (2001)
Maytal Saar-tsechansky, Foster Provost
For many supervised learning tasks it is very costly to produce training data with class labels. Active learning acquires data incrementally, at each stage using the model learned so far to help...
Intelligent Information Triage (2001)
Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan, Vasant Dhar
In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective...
Applications of data mining to electronic commerce (2001)
Electronic commerce is emerging as the killer domain for data mining technology. Is there support for such a bold statement? Data mining techologies have been around for decades, without moving...
The Effect of Class Distribution on Classifier Learning: An Empirical Study (2001)
In this article we analyze the effect of class distribution on classifier learning. We begin by describing the different ways in which class distribution affects learning and how it affects the...
The Effect of Class Distribution on Classifier Learning: An Empirical Study (2001)
Many of today's large data sets must be reduced in size before invoking inductive algorithms, due to the costs associated with procuring/processing the data, and because most of these algorithms...
Information Triage using Prospective Criteria (2001)
Sofus A. Macskassy, Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan, Vasant Dhar
: In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective...
Active Sampling for Class Probability Estimation and Ranking (2001)
Maytal Saar-tsechansky, Foster Provost
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class...
Active -Learning for Decision-Making (2001)
Maytal Saar-tsechansky, Foster Provost
This paper addresses focused information acquisition for predictive data mining. As businesses strive to cater to the preferences of individual consumers, they often employ predictive models to...
Applications of Data Mining to Electronic Commerce (2000)
Electronic commerce is emerging as the killer domain for data mining technology. The following are five desiderata for success. Seldom are they they all present in one data mining application. 1....
Robust Classification for Imprecise Environments (2000)
In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification...
Dhar, Vasant, Chou, Dashin, Provost, Foster
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...
Dhar, Vasant, Chou, Dashin, Provost, Foster
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...
Dhar, Vasant, Chou, Dashin, Provost, Foster
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...
Dhar, Vasant, Chou, Dashin, Provost, Foster
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...
Variance-based Active Learning (2000)
Saar-Tsechansky, Maytal, Provost, Foster
For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....
Variance-based Active Learning (2000)
Saar-Tsechansky, Maytal, Provost, Foster
For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....
Variance-based Active Learning (2000)
Saar-Tsechansky, Maytal, Provost, Foster
For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....
Variance-based Active Learning (2000)
Saar-Tsechansky, Maytal, Provost, Foster
For many supervised learning tasks, the cost of acquiring training data is dominated by the cost of class labeling. In this work, we explore active learning for class probability estimation (CPE)....
Vasant Dhar, Dashin Chou, Foster Provost
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open ended by forcing us to consider...
Robust classification for imprecise environments (2000)
In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification...
Vasant Dhar, Dashin Chou, Dashin Chou, Foster Provost, Foster Provost
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider...
Activity monitoring: Noticing interesting changes in behavior (1999)
We introduce a problem class which we term activity monitoring. Such problems involve monitoring the behavior of a large population of entities for interesting events requiring action. We present a...
Rule-space Search for Knowledge-based Discovery (1999)
Foster Provost, John M. Aronis, Bruce G. Buchanan
Because the knowledge discovery process is ill-defined, iterative, and requires intense interaction, algorithm flexibility is crucial. In this paper, we present a straighforward, heuristic...
Efficient Progressive Sampling (1999)
Foster Provost, David Jensen, Tim Oates
Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the...
A Survey of Methods for Scaling Up Inductive Algorithms (1999)
Foster Provost, Venkateswarlu Kolluri
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work...
Problem Definition, Data Cleaning, and Evaluation: A Classifier Learning Case Study (1999)
Foster Provost, Andrea Pohoreckyj Danyluk
This paper is a case study of this process based on a long-term project addressing the automatic dispatch of technicians to fix faults in the local loop of a telephone network. The bottom line of the...
A Survey of Methods for Scaling Up Inductive Algorithms (1999)
Foster Provost, Venkateswarlu Kolluri, Usama Fayyad
. One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. By collecting, categorizing, and summarizing existing work on...
Problem Definition, Data Cleaning, and Evaluation: A Classifier Learning Case Study (1999)
Foster Provost, Andrea Pohoreckyj Danyluk
This paper is a case study of this process based on a long-term project addressing the automatic dispatch of technicians to fix faults in the local loop of a telephone network. The bottom line of the...
Efficient Progressive Sampling (1999)
Foster Provost, David Jensen, Tim Oates
Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the...
Distributed Data Mining: Scaling up and beyond (1999)
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey...
Distributed Fault Tolerant Embeddings of Binary Trees in Hypercubes. (1998)
This paper presents a distributed algorithm for embedding binary trees in hypercubes. Starting with the root (invoked in some cube node by a host), each node is responsible for determining the...
Classificaiton in Networked Data: A Toolkit and a Univariate Case Study (1998)
Macskassy, Softus A., Provost, Foster
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research....
Monitoring Business Activity (1998)
Provost, Foster, Macskassy, Sofus
Under this project, the authors studied and developed technologies to "score" entities to build models that will produce an estimate of the likelihood that an entity exhibits some characteristic. For...
The Case Against Accuracy Estimation for Comparing Induction Algorithms (1998)
We analyze critically the use of classi cation accuracy to compare classi ers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and...
Pharmacophore discovery using the inductive logic programming system PROGOL (1998)
Paul Finn, David Page, Ronny Kohavi, Foster Provost
Abstract. This paper is a case study of a machine aided knowledge discovery process within the general area of drug design. More specifically, the paper describes a sequence of experiments in which...
On Applied Research in Machine Learning (1998)
bstacles that impede their practical application. Most often these obstacles take the form of restrictive simplifying assumptions commonly made in research. Consider as an example the assumption,...
The Case Against Accuracy Estimation for Comparing Induction Algorithms (1998)
Foster Provost, Tom Fawcett, Ron Kohavi
We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and...
Pharmacophore Discovery using the Inductive Logic Programming System Progol (1998)
Paul Finn, David Page, Ron Kohavi, Foster Provost
. This paper presents a case study of a machine-aided knowledge discovery process within the general area of drug design. Within drug design, the particular problem of pharmacophore discovery is...
Robust Classification Systems for Imprecise Environments (1998)
In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We show that it is...
Pharmacophore Discovery using the Inductive Logic Programming System Progol (1998)
Paul Finn, David Page, Ronny Kohavi, Foster Provost
. This paper is a case study of a machine aided knowledge discovery process within the general area of drug design. More specifically, the paper describes a sequence of experiments in which an...
Machine Learning for the Detection of Oil Spills in Satellite Radar Images (1998)
Miroslav Kubat, Robert C. Holte, Stan Matwin, Ron Kohavi, Foster Provost
During a project examining the use of machine learning techniques for oil spill detection, we encountered several essential questions that we believe deserve the attention of the research community....
The Case Against Accuracy Estimation for Comparing Induction Algorithms (1998)
Foster Provost, Tom Fawcett, Ron Kohavi
We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and...
Learning in the 'Real World' (1998)
Lorenza Saitta, Filippo Neri, Ronny Kohavi, Foster Provost
. In this paper we define and characterize the process of developing a "real-world" Machine Learning application, with its difficulties and relevant issues, distinguishing it from the...
Robust Classification Systems for Imprecise Environments (1998)
In real-world environments, it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We show that it is...
Adaptive fraud detection. Data Mining and Knowledge Discovery (1997)
Abstract. One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection,...
Scaling Up inductive algorithms: An overview (1997)
Foster Provost, Venkateswarlu Kolluri
This paper establishes common ground for researchers addressing the challenge of scaling up inductive data mining algorithms to very large databases, and for practitioners who want to understand the...
Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions...
Adaptive Fraud Detection (1997)
. One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a...
A Survey of Methods for Scaling Up Inductive Learning Algorithms (1997)
Foster J. Provost, Venkateswarlu Kolluri, Foster Provost
: Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By...
Combining Data Mining and Machine Learning for Effective User Profiling (1996)
This paper describes the automatic design of methods for detecting fraudulent behavior. Much of the design is accomplished using a series of machine learning methods. In particular, we combine data...
Robust Classification for Imprecise Environments (1989)
In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for...
Rule-space search for knowledge-based discovery (1001)
Foster Provost, John M. Aronis, Bruce G. Buchanan
Because the knowledge discovery process is ill-dened, iterative, and requires intense interaction, algorithm exibility is crucial. In this paper, we present a straighforward, heuristic...