Efficient Computation of Personal Aggregate Queries on (2009)
Ka Cheung Sia, Junghoo Cho, Yun Chi, Belle L. Tseng
There is an exploding amount of user-generated content on the Web due to the emergence of “Web 2.0 ” services, such as Blogger, MySpace, Flickr, and del.icio.us. The participation of a large...
On the brink: Searching for drops in sensor data ∗ (2009)
Gong Chen, Junghoo Cho, Mark H. Hansen
Sensor networks have been widely used to collect data about the environment. When analyzing data from these systems, people tend to ask exploratory questions—they want to find subsets of data,...
Personalization of web search results as a technique for improving user satisfaction has received notable attention in the research community over the past decade. Much of this work focuses on...
Algorithms, Experimentation (2009)
Ka Cheung Sia, Junghoo Cho, Yun Chi
There is an exploding amount of user-generated content on the Web due to the emergence of “Web 2.0 ” services, such as Blogger, MySpace, Flickr, and del.icio.us. The participation of a large...
RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee ABSTRACT (2008)
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web,...
ABSTRACT On the brink: Searching for drops in sensor data ∗ (2008)
Sensor networks have been widely used to collect data about the environment. When analyzing data from these systems, people tend to ask exploratory questions—they want to find subsets of data,...
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in...
Digital Libraries Data Quality Page Quality: In Search of an Unbiased Web Ranking (2008)
Junghoo Cho, Robert E. Adams, Junghoo Cho, Junghoo Cho
This research is motivated by the dominance of the Google search engine and the bias that it may introduce to the users ’ perception of the Web. According to a recent study, 75 % of keyword...
Junghoo Cho, Hector Gracia-molina, Important Pages
� What pages should the crawler download? � How should the crawler refresh pages? � How should the load on the visited Web sites be minimized? � How should the crawling process be...
ABSTRACT RankMass Crawler: A Crawler with High PageRank Coverage Guarantee (2008)
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the infinite number of pages available on the Web,...
Microsoft Search Labs and (2008)
Panagiotis G. Ipeirotis, Junghoo Cho, Luis Gravano
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency,...
Understanding Pollution Dynamics in P2P File Sharing (2008)
Uichin Lee, Min Choi, Junghoo Cho, M. Y. Sanadidi, Mario Gerla
Pollution in P2P file sharing occurs when a large number of decoy files are injected into the P2P system. Since peers “serve ” each other in the P2P file sharing system, it is obvious that...
Stanford WebBase Components and Applications (2008)
Junghoo Cho, Hector Garcia-molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, Sriram Raghavan, ...
We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and...
RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee ABSTRACT (2008)
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web,...
Understanding Pollution Dynamics in P2P File Sharing (2008)
Uichin Lee, Min Choi, Junghoo Cho, M. Y. Sanadidi, Mario Gerla
Pollution in P2P file sharing occurs when a large number of decoy files are injected into the P2P system. Since peers “serve ” each other in the P2P file sharing system, it is obvious that...
Monitoring RSS Feeds Based on User Browsing Pattern Abstract (2008)
RSS has been widely used to disseminate information on the Web over the years. With the help of RSS feed readers, a user may subscribe to the feeds that are published by her favorite blogs, news...
Monitoring RSS Feeds based on User Browsing Pattern Abstract (2008)
RSS has been widely used to disseminate information on the Web over the years. With the help of RSS feed readers, a user may subscribe to the feeds that are published by her favorite blogs, news...
Orkut Buyukkokten, Luis Gravano, Junghoo Cho, Hector Garcia-molina, Narayanan Shivakumar
Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are...
Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis Gravano
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers ” provide unified interfaces to query multiple such databases at once. For efficiency,...
Orkut Buyukkokten, Luis Gravano, Junghoo Cho, Hector Garcia-molina, Narayanan Shivakumar
Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are...
Modeling and Managing Content Changes in Text Databases (2008)
Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis Gravano
Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For...
Analysis of User Web Traffic with a Focus on Search Activities (2008)
Feng Qiu, Zhenyu Liu, Junghoo Cho
Although search engines are playing an increasingly important role in users' Web access, our understanding is still limited regarding the magnitude of search-engine influence. For example, how...
Crawling for Images on the WWW (2008)
Junghoo Cho And, Junghoo Cho, Sougata Mukherjea
Search engines are useful because they allow the user to #nd information of interest from the World-Wide Web. These engines use a crawler to gather information from Web sites. However, with the...
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in...
Crawler-Friendly Web Servers (2007)
Onn Br, Junghoo Cho, Hector Garcia-molina, Narayanan Shivakumar
In this paper we study how to make web servers (e.g., Apache) more crawler friendly. Current web servers offer the same interface to crawlers and regular web surfers, even though crawlers and surfers...
Orkut Buyukkokten, Luis Gravano, Junghoo Cho, Hector Garcia-molina, Narayanan Shivakumar
Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are...
Modeling and Managing Changes in Text Databases (2007)
Ipeirotis, Panagiotis, Ntoulas, Alexandros, Cho, Junghoo, Gravano, Luis
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency,...
Sensor-Internet Share and Search–Enabling Collaboration of Citizen Scientists (2007)
Sasank Reddy, Gong Chen, Brian Fulkerson, Sung Jin Kim, Unkyu Park, Nathan Yau, ...
Over the last decade, embedded sensing systems have been successfully deployed in a range of application areas, from education and science to military and industry. These systems are becoming more...
Modeling and Managing Content Changes in Text Databases (2006)
Ipeirotis, Panagiotis G., Ntoulas, Alexandros, Cho, Junghoo, Gravano, Luis
Large amounts of (often valuable) information are stored in web-accessible text databases. ``Metasearchers'' provide unified interfaces to query multiple such databases at once. For efficiency,...
Modeling and Managing Content Changes in Text Databases (2006)
Ipeirotis, Panagiotis G., Ntoulas, Alexandros, Cho, Junghoo, Gravano, Luis
Large amounts of (often valuable) information are stored in web-accessible text databases. ``Metasearchers'' provide unified interfaces to query multiple such databases at once. For efficiency,...
Modeling and Managing Changes in Text Databases (2006)
Ipeirotis, Panagiotis G., Ntoulas, Alexandros, Cho, Junghoo, Gravano, Luis
Large amounts of (often valuable) information are stored in web-accessible text databases. ``Metasearchers'' provide unified interfaces to query multiple such databases at once. For efficiency,...
Modeling and Managing Changes in Text Databases (2006)
Ipeirotis, Panagiotis G., Ntoulas, Alexandros, Cho, Junghoo, Gravano, Luis
Large amounts of (often valuable) information are stored in web-accessible text databases. ``Metasearchers'' provide unified interfaces to query multiple such databases at once. For efficiency,...
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results (2005)
Pandey, Sandeep, Roy, Sourashis, Olston, Christopher, Cho, Junghoo, Chakrabarti, Soumen
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
Shuffling a stacked deck: the case for partially randomized ranking of search engine results (2005)
Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results (2005)
Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results (2005)
Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
The Infocious Web Search Engine: Improving Web Searching through Linguistic Analysis (2005)
Alexandros Ntoulas, Gerald Chao, Junghoo Cho, Infocious Inc
In this paper we present the Infocious Web search engine [23]. Our goal in creating Infocious is to improve the way people find information on the Web by resolving ambiguities present in natural...
Automatic Identification of User Goals in Web Search (2005)
Uichin Lee, Zhenyu Liu, Junghoo Cho
There have been recent interests in studying the "goal" behind a user's Web query, so that this goal can be used to improve the quality of a search engine's results. Previous...
Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty (2005)
Zhenyu Liu, Ka Cheung Sia, Junghoo Cho
The rapid development in micro-sensors and wireless networks has made large-scale sensor networks possible. However, the wide deployment of such systems is still hindered by their limited energy...
Automatic Identification of User Goals in Web Search (2005)
Uichin Lee, Zhenyu Liu, Junghoo Cho
There have been recent interests in studying the "goal" behind a user's Web query, so that this goal can be used to improve the quality of a search engine's results. Previous...
Efficient Monitoring Algorithm for Fast News Alert (2005)
use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the...
A Study On The Evolution Of The Web (2005)
Alexandros Ntoulas Junghoo, Ros Ntoulas, Junghoo Cho, Hyun Kyu Cho, Hyeonsung Cho, Young-jo Cho
this paper, we study the evolution of the Web from the perspective of a search engine, so that we can get a better understanding on how search engines should cope with the evolving Web. We believe...
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results (2005)
Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results (2005)
Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
Shuffling a Stacked Deck: The Case for Partially (2005)
Randomized Ranking Of, Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
Modeling and managing content changes in text databases (2005)
Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers ” provide unified interfaces to query multiple such databases at once. For efficiency,...
Shuffling a stacked deck: the case for partially randomized ranking of search engine results (2005)
Sandeep Pandey, Sourashis Roy, Christopher Olston, Junghoo Cho, Soumen Chakrabarti
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is...
The infocious web search engine: Improving web searching through linguistic analysis (2005)
Alexandros Ntoulas, Gerald Chao, Junghoo Cho
In this paper we present the Infocious Web search engine [23], which currently indexes more than 2 billion pages collected from the Web. The main goal of Infocious is to enhance the way that people...
Modeling and managing content changes in text databases (2005)
Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis Gravano
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency,...
Modeling and managing content changes in text databases (2005)
Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers ” provide unified interfaces to query multiple such databases at once. For efficiency,...
Analysis of user web traffic with a focus on search activities (2005)
Feng Qiu, Zhenyu Liu, Junghoo Cho
Although search engines are playing an increasingly important role in users ’ Web access, our understanding is still limited regarding the magnitude of search-engine influence. For example, how...
Modeling and Managing Content Changes in Text Databases (2004)
Ipeirotis, Panagiotis G., Ntoulas, Alexandros, Cho, Junghoo, Gravano, Luis
Large amounts of (often valuable) information are stored in web-accessible text databases. ``Metasearchers'' provide unified interfaces to query multiple such databases at once. For efficiency,...
What’s New on the Web? The Evolution of the Web from a Search Engine Perspective (2004)
Ntoulas, Alexandros, Cho, Junghoo, Olston, Christopher
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected...
Impact Of Search Engines On Page Popularity (2004)
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In...
Impact of Search Engines on Page Popularity (2004)
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In...
Impact of Search Engines on Page Popularity (2004)
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In...
A probabilistic approach to metasearching with adaptive probing (2004)
Zhenyu Liu, Chang Luo, Junghoo Cho, Wesley W. Chu
An ever increasing amount of valuable information is stored in Web databases, “hidden ” behind search interfaces. To save the user’s effort in manually exploring each database, metasearchers...
Dpro: A probabilistic approach for hidden web database selection using dynamic probing (2004)
Victor Z. Liu, Richard C. Luo, Junghoo Cho, Wesley W. Chu
An ever increasing amount of valuable information is stored in Web databases, “hidden ” behind search interfaces. To save the user’s effort in manually exploring each database, metasearchers...
What’s New on the Web? The Evolution of the Web from a Search Engine Perspective (2004)
Alexandros Ntoulas, Junghoo Cho, Christopher Olston
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected...
What's New on the Web? The Evolution of the Web from a Search Engine Perspective (2004)
Alexandros Ntoulas, Junghoo Cho, Christopher Olston
In this paper, we seek to get a better insight on how the search engines should cope with the evolving Web, in an attempt to provide users with up-to-date results. In this respect, we have crawled...
Downloading Hidden Web Content (2004)
Alexandros Ntoulas, Petros Zerfos, Junghoo Cho
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from...
What’s New on the Web? The Evolution of the Web from a Search Engine Perspective (2004)
Alexandros Ntoulas, Junghoo Cho, Christopher Olston
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected...
Effective page refresh policies for web crawlers (2003)
Junghoo Cho, Hector Garcia-molina
In this paper we study how we can maintain local copies of remote data sources “fresh, ” when the source data is updated autonomously and independently. In particular, we study the problem of Web...
Page Quality: In Search of an Unbiased Web Ranking (2003)
This research is motivated by the dominance of the Google search engine and the bias that it may introduce to the users' perception of the Web. According to a recent study, 75% of keyword...
Page Quality: In Search of an Unbiased Web Ranking (2003)
Junghoo Cho Sourashis, Junghoo Cho, Sourashis Roy, Robert E. Adams
In a number of recent studies [4, 8] researchers have found that because search engines repeatedly return currently popular pages at the top of search results, popular pages tend to get even more...
Effective page refresh policies for web crawlers (2003)
Junghoo Cho, Hector Garcia-molina
In this paper we study how we can maintain local copies of remote data sources “fresh, ” when the source data is updated autonomously and independently. In particular, we study the problem of Web...
Cho, Junghoo, Garcia-Molina, Hector
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in...
A fast regular expression indexing engine (2002)
In this paper, we describe the design, architecture, and the lessons learned from the implementation of a fast regular expression indexing engine FREE. FREE uses a prebuilt index to identify the text...
Arvind Arasu, Junghoo Cho, Hector Garcia-molina, Andreas Paepcke, Sriram Raghavan
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage,...
Arvind Arasu, Junghoo Cho, Hector Garcia-molina, Andreas Paepcke, Sriram Raghavan
We o#er an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage,...
Arvind Arasu, Junghoo Cho, Hector Garcia-molina, Andreas Paepcke, Sriram Raghavan
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage,...
The evolution of the web and implications for an incremental crawler (2000)
Junghoo Cho, Hector Garcia-molina
In this paper we study how to build an e#ective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically...
Finding replicated web collections (2000)
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-molina
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we...
Beyond document similarity: Understanding value-based search and browsing technologies (2000)
Andreas Paepcke, Hector Garcia-molina, Gerard Rodriguez-mula, Junghoo Cho
In the face of small, one or two word queries, high volumes of diverse documents on the Web are overwhelming search and ranking technologies that are based on document similarity measures. The...
The evolution of the web and implications for an incremental crawler (2000)
Junghoo Cho, Hector Garcia-molina
In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically...
Finding replicated web collections (2000)
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-molina
Paper Number 201 Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In...
Crawler-Friendly Web Servers (2000)
Onn Brandman, Junghoo Cho, Hector Garcia-molina, Narayanan Shivakumar
In this paper we study how to make web servers #e.g., Apache# morecrawler friendly. Current web servers o#er the same interfacetocrawlers and regular web surfers, even though crawlers and surfers...
Estimating Frequency of Change (2000)
Junghoo Cho, Junghoo Cho, Junghoo Cho, Hector Garcia-molina, Hector Garcia-molina
Many online data sources are updated autonomously and independently. In this paper, we make the case for estimating the change frequency of the data, to improve web crawlers, web caches and to help...
Finding Replicated Web Collections (2000)
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-molina
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we...
Synchronizing a database to Improve Freshness (2000)
Junghoo Cho, Hector Garcia-molina
In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more difficult to maintain the copy...
Junghoo Cho, Hector Garcia-molina
this paper, we make the case for estimating the change frequency of data to improve Web crawlers, Web caches and to help data mining. We first identify various scenarios, where di#erent applications...
Beyond document similarity: Understanding value-based search and browsing technologies (2000)
Andreas Paepcke, Hector Garcia-molina, Gerard Rodriguez-mula, Junghoo Cho
In the face of small, one or two word queries, high volumes of diverse documents on the Web are overwhelming search and ranking technologies that are based on document similarity measures. The...
Exploiting geographical location information of web pages (1999)
Orkut Buyukkokten, Junghoo Cho, Hector Garcia-molina, Luis Gravano, Narayanan Shivakumar
Many information sources on the web are relevant primarily to specific geographical communities. For instance, web sites containing information on restaurants, theatres and apartment rentals are...
Exploiting Geographical Location Information of Web Pages (1999)
Orkut Buyukkokten, Junghoo Cho, Hector Garcia-molina, Luis Gravano, Narayanan SHivakumar
Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are...
Exploiting Geographical Location Information of Web Pages (1999)
Orkut Buyukkokten, Junghoo Cho, Hector Garcia-molina, Luis Gravano, Narayanan Shivakumar
Many information sources on the web are relevant primarily to specific geographical communities. For instance, web sites containing information on restaurants, theatres and apartment rentals are...
Journal of Visual Languages and Computing (1999) 10, 585}606 (1999)
Article No Jvlc, Sougata Mukherjea, Junghoo Cho
this paper, we describe the problems in determining the semantics of images on the WWW and the approach of AMORE, a WWW search engine that we have developed. AMORE's techniques can be extended...
Efficient crawling through URL ordering (1998)
Junghoo Cho, Hector Garcia-molina
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important ” pages first. Obtaining important pages rapidly can be very useful when a...
Efficient Crawling Through URL Ordering (1998)
Junghoo Cho, Hector Garcia-molina, Lawrence Page
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when...