Ricardo Baeza-Yates

Generalizing pagerank: Damping functions for linkbased ranking algorithms (2009)

Ricardo Baeza-yates, Paolo Boldi, Carlos Castillo

This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct...

1 The Choice of a Damping Function for Propagating Importance in Link-Based Ranking (2009)

Ricardo Baeza-yates, Depto De Tecnología, Paolo Boldi, Carlos Castillo, Cátedra Telefónica, ...

This paper studies a family of link-based algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with the distance, so a direct link...

Zone Country Domain Zone Country Domain (2009)

Carlos Castillo, Ricardo Baeza-yates

Abstract. We show that the Kendall’s τ metric over the ranked lists of commercial partners between European countries, induces a distance function that is very useful for visualizing the European...

���� � A Model for Web Mining Applications – Conceptual Model, Architecture, Implementation and Use Cases (2009)

Departamento De, Ciência Da, Álvaro Rodrigues, Pereira Júnior, Ricardo Baeza-yates, Álvaro Pereira, ...

Web mining is a computation intensive task even after the mining tool itself has been developed. However, most mining software is developed ad-hoc and usually is not scalable nor reused for other...

References (2009)

Ricardo Baeza-yates, Rosie Jones

• evaluation [38] data collections available [26] [69] [24]

Applications of web query mining (2009)

Ricardo Baeza-yates

Abstract. Server logs of search engines store traces of queries submitted by users, which include queries themselves along with Web pages selected in their answers. The same is true in Web site logs...

Departamento de Engenharia Eletr^onica (2009)

Ricardo Baeza-yates, Eduardo Fernandes Barbosa, Minas Gerais, Nivio Ziviani, Minas Gerais

Abstract The objective of this paper is to present an efficient implementation of a recently known index for text databases presented in the literature, when the database is stored on read-only...

Optimal Binary Search Trees with Costs Depending on the Access Paths (2009)

Jayme L Szwarcfiter, Gonzalo Navarro, Jo'isa S. Oliveira, Ricardo Baeza-yates, Walter Cunto

We describe algorithms for constructing optimal binary search trees, in which the access cost to each key depends on the k preceeding keys which were reached in the path to the desired key. Two kinds...

Theory vs. Practice (2008)

Ricardo Baeza-yates

[19] and Navarro et al. [23], and own work and views.

WIM: an Information Mining Model for the Web (2008)

Ricardo Baeza-yates

This paper presents an extended abstract of a model to mine information in applications involving Web and graph analysis, referred to as WIM (Web Information Mining) Model. We demonstrate the model...

ABSTRACT Genealogical Trees on the Web: A Search Engine User Perspective (2008)

Ricardo Baeza-yates

This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing...

Algorithms, Experimentation (2008)

Ricardo Baeza-yates, Vanessa Murdock

In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching...

DEFINITION (2008)

Djoerd Hiemstra, Ricardo Baeza-yates

Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The...

Oral Contributions FIRST NAME LAST NAME TITLE ABSTRACT (2008)

Sebastian Ahnert, Daniel Amit, Ricardo Baeza-yates

Applying weighted network measures to distance matrices Many approaches to the analysis of weighted networks are not designed for fully connected weighted networks. However, as any distance matrix...

� Query Languages (2008)

Sihem Amer-yahia, Mariano Consens, Ricardo Baeza-yates, Mounia Lalmas

� DB focused on languages, expressiveness and efficient evaluation � IR focused on scoring and relevance metrics � In practice, p a limited set of operations p and simple ranking go a long way...

TITLE: Algorithmic Challenges in Web Search Engines (2008)

Ricardo Baeza-yates, Jon Bentley, Sotiris Nikoletseas

We present the main algorithmic challenges that large Web search engines face today. These challenges are present in all the modules of a Web retrieval system, ranging from the gathering of the data...

AND (2008)

Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-yates, José Luis Marroquín

The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from...

AFaster Algorithm for Approximate String Matching Extended Abstract (2008)

Ricardo Baeza-yates, Gonzalo Navarro

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic nite automaton built from the pattern and using the text as input....

Understanding Content Reuse on the Web: Static and Dynamic Analyses (2008)

Ricardo Baeza-yates, Álvaro Pereira, Nivio Ziviani

Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar content among pages...

Query Sessions (2008)

David Nettleton, Liliana Calderón, Presented Álvaro, R. Pereira, Ricardo Baeza-yates

♦ Problem to be solved: identification of user profiles (informational, navigational or transactional) and quality indicators for query sessions from the diversity of queries made by the users and...

Web Search (2008)

Ricardo Baeza-yates, Carlos Castillo

The World Wide Web has become in a few years into the largest cultural endeavour of all times. The Web is a distributed repository of information without a central point of control, and thus can be...

Web Log Mining in Search Engines (2008)

Ricardo Baeza-yates

Search engine logs not only keep navigation information, but also the queries made by their users. In particular, queries to a search engine follow a power-law distribution, which is far from...

WIM: an Information Mining Model for the Web (2008)

Ricardo Baeza-yates, Álvaro R. Pereira, Nivio Ziviani

This paper presents a model to mine information in applications involving Web and graph analysis, referred to as WIM – Web Information Mining – model. We demonstrate the model characteristics...

Categories and Subject Descriptors: J.4 [Social and Behavioral Sciences]: Economics; H.4.m [Information Systems]: Miscellaneous (2008)

Ricardo Baeza-yates, Carlos Castillo

We report on observations on Web characterization studies that suggest that the amount of Web links among sites under different country-code top-level domains is related to the amount of trade...

Associate Editors (2008)

Masaru Kitsuregawa, Betty Salzberg, Gonzalo Navarro, Ricardo Baeza-yates, Erkki Sutinen, Jorma Tarhio, ...

IntegratingDiverseInformationManagementSystems:ABriefSurvey..................................

Abstract (2008)

Ricardo Baeza-yates, Paolo Boldi, Carlos Castillo

This paper introduces a family of link-based ranking algorithms that propagate page importance through links. The algorithms include a damping function which decreases with distance, thus a direct...

Link Analysis in National Web Domains (2008)

Ricardo Baeza-yates

The Web can be seen as a graph in which every page is a node, and every hyper-link between two pages is an edge. This Web graph forms a scale-free network: a graph in which the distribution of the...

Dipartimento di Scienze dell'Informazione (2008)

Rapporto Interno The, Ricardo Baeza-yates, Paolo Boldi, Carlos Castillo, Ricardo Baeza-yates, Depto De Tecnología, ...

This paper studies a family of link-based algorithms that propagate page importance through links.

The Choice of a Damping Function (2008)

For Propagating Importance, Ricardo Baeza-yates, Depto De Tecnología, Paolo Boldi, Carlos Castillo, Cátedra Telefónica, ...

This paper studies a family of link-based algorithms that propagate page importance through links.

The EC Query Language Applied to Old Manuscripts (2007)

Jesús Vegas, Ricardo Baeza-yates

We show the possibilities of the EC query language in a very structured environment as a catalog of old manuscripts. The EC language can deal with simple queries and with more complex ones, as...

Compressed Pattern Matching Approximate Compressed Pattern Matching Direct Compressed Pattern Matching (2007)

Gonzalo Navarro, Nivio Ziviani, Ricardo Baeza-Yates

We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential...

A New Priority Queue for Simulation of Many Objects (2007)

Ricardo Baeza-yates, Mauricio Marín

During the discrete event simulation of complex systems based on many active/passive objects, the efficiency of the algorithm and data structure used to manage the events of the process is crucial....

Proximity Matching Using Fixed-Queries Trees (2007)

Ricardo Baeza-yates, Walter Cunto, Udi Manber, Sun Wu

. We present a new data structure, called the fixed-queries tree, for the problem of finding all elements of a fixed set that are close, under some distance function, to a query element. Fixedqueries...

Handling Proximity for Text Problems (2007)

Ricardo Baeza-yates, Walter Cunto

Introduction Let P ` X \Theta Y be a set of pairs such that the finite sets X and Y contain n and m nonnegative integers respectively, and d be a nonnegative value. We define the set of close-pairs C...

About the Size of Boyer-Moore Automata (2007)

Véronique Bruyère, Ricardo Baeza-yates, Olivier Delgrange, Rodrigo Scheihing

. We study the size of Boyer-Moore automata introduced in Knuth, Morris & Pratt's famous paper on pattern matching. We experimentally exhibit a finite class of binary patterns, which produce...

A First Step to Formally Evaluate Collaborative Work (2007)

Ricardo Baeza-yates, Jos'e A. Pino

We present an initial attempt to formally evaluate performance measures related to CSCW applications. In particular, we study the relations between the quality of the output, the number of people and...

A Model for Visualizing Large Answers in WWW Retrieval (2007)

Omar Alonso, Ricardo Baeza-yates

In this paper we present a model for visualizing large collections of documents in World Wide Web retrieval, independently of the retrieval system. Our proposal allows to ease the use of...

Analysis Of An Improved Priority Queue For Discrete Event Simulation Of Many Moving Objects (2007)

Ricardo Baeza-yates, Blanco Encalada, Escuela De Computaci'on, Patricio Cordero

During the simulation of physical systems based on many moving objects, the efficiency of the algorithm and data structure used to manage the events of the process is crucial. Both, runtime and space...

Optimized Binary Search and Text Retrieval (2007)

Eduardo Fernandes, Eduardo Fernandes Barbosa, Gonzalo Navarro, Ricardo Baeza-yates, Chris Perleberg, Nivio Ziviani

We present an algorithm that minimizes the expected cost of indirect binary search for data with nonconstant access costs, such as disk data. Indirect binary search means that sorted access to the...

Analysis Of An Improved Priority Queue For Discrete Event Simulation Of Many Moving Objects (2007)

Ricardo Baeza-yates, Mauricio Marín, Blanco Encalada, Escuela De Computaci'on, Patricio Cordero

During the simulation of physical systems based on many moving objects, the efficiency of the algorithm and data structure used to manage the events of the process is crucial. Both runtime and space...

y (2007)

Gonzalo Navarro, Nivio Ziviani, Belo Horizonte Brasil, Belo Horizonte Brasil, Ricardo Baeza-yates

We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential...

Optimal Binary Search Trees with Costs Depending on the Access Paths (2007)

Gonzalo Navarro, Ricardo Baeza-yates, Walter Cunto

We describe algorithms for constructing optimal binary search trees, in which the access cost of a key depends on the k preceding keys which were reached in the path to it. This problem has...

A New Model for Web Crawling (2007)

Carlos Castillo, Ricardo Baeza-yates, Blanco Encalada

Web crawlers have to deal with a lot of challenges at the same time, and some of them contradict each other. They must keep fresh copies of Web pages, so they have to re-visit some of them, but at...

Web Structure, Age and Page Quality (2007)

Ricardo Baeza-yates, Felipe Saint-jean, Carlos Castillo

This paper is aimed at the study of quantitative measures of the relation between Web structure, age, and quality ofWeb pages. Quality is studied from di erent link-based metrics and their...

A New Model for Web Crawling* (2007)

Carlos Castillo, Ricardo Baeza-yates

Web crawlers have to deal with a lot of challenges at the same time, and some of them contradict each other. They must keep fresh copies of Web pages, so they have to re-visit some of them, but at...

IMPLEMENTING CONSISTENT QUERY ANSWERS IN INCONSISTENT DATABASES (2007)

Leopoldo Bertossi D, Leopoldo Bertossi D, Pontificia Universidad, Pontificia Universidad, Católica De Chile, Católica De Chile, ...

It’s strange. The gulls who scorn per-fection for the sake of travel go nowhere, slowly. Those who put aside travel for the sake of perfection go anywhere, instantly. Chiang To my parents and my...

www.dcc.ufmg.br/~nivio (2007)

Gonzalo Navarro, Nivio Ziviani, Ricardo Baeza-yates

www.dcc.uchile.cl/~rbaeza Abstract We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed...

Optimized Indirect Binary Search and Text Retrieval (Preliminary Version) (2007)

Gonzalo Navarro, Eduardo Fern, Ez Barbosa, Chris Perleberg, Ricardo Baeza-yates, Nivio Ziviani

We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs (e.g. disk data). The term "indirect " indicates that...

Present and Future of the Informatics Profession Let’s Design Everything Again: Thoughts on Computing and Its Teaching (2007)

Ricardo Baeza-yates

“Man is the only animal to trip over the same stone twice” “El hombre es el único animal capaz de tropezar dos veces en la misma piedra” Spanish popular saying. “Everything is highly...

, and Jos'e L. Marroqu'in (2007)

Gonzalo Navarro, Ricardo Baeza-yates

Abstract. The indexing algorithms and data structures for similarity searching in metric spaces seem to emerge from a great diversity, and different approaches have been proposed and analyzed...

Depto. de Ciencias de la Computaci'on (2007)

Ricardo Baeza-yates, Gonzalo Navarro

We consider the recently proposed XQL language, which is designed to query XML documents by content and structure. We show that an already existing model, namely "Proximal Nodes",...

y (2007)

Gonzalo Navarro, Marden Neubert, Nivio Ziviani, Ricardo Baeza-yates

Abstract. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval....

Optimized Binary Search and Text Retrieval 1 (2007)

Eduardo Fernandes Barbosa, Gonzalo Navarro, Ricardo Baeza-yates, Chris Perleberg, Nivio Ziviani

Abstract. We present an algorithm that minimizes the expected cost of indirect binary search for data with non-constant access costs, such as disk data. Indirect binary search means that sorted...

Searching in Metric Spaces (2007)

Gonzalo Navarro, Ricardo Baeza-yates

The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from...

Sydney ffl Singapore ffl Tokyo ffl Madrid (2007)

Ricardo Baeza-yates, Berthier Ribeiro-neto

and Associated Companies throughout the World. The rights of the authors of this Work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved....

Analysis of Link Based Ranking for the Web (2007)

Ricardo Baeza-yates, Carlos Castillo

In the last years, several techniques based in link analysis have been proposed and used in search engines to rank Web pages. As links are generated by humans, link based ranking seems to give better...

Challenges on distributed web retrieval (2007)

Ricardo Baeza-yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on...

Applied Text Analytics for Blogs (2007)

Gilad Mishne, Academisch Proefschrift, Gilad Avraham Mishne, Promotor Prof. Dr, Maarten Rijke, Prof. Dr. Ricardo Baeza-yates, ...

ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. J.W. Zwemmer ten overstaan van een door het college voor promoties ingestelde...

A study of mobile search queries in Japan (2007)

Ricardo Baeza-yates, Georges Dupret, Javier Velasco

In this paper we study the characteristics of search queries on mobile phones in Japan, comparing them with previous results of generic search queries in Japan and mobile search queries in the USA....

Challenges on distributed web retrieval (2007)

Ricardo Baeza-yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on...

Admission policies for caches of search engine results (2007)

Ricardo Baeza-yates, Flavio Junqueira, Vassilis Plachouras, Hans Friedrich Witschel

Abstract. This paper studies the impact of the tail of the query distribution on caches of Web search engines, and proposes a technique for achieving higher hit ratios compared to traditional...

Link analysis for web spam detection (2007)

Luca Becchetti, Carlos Castillo, Debora Donato, Ricardo Baeza-yates, Stefano Leonardi

We propose link-based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of...

Relationship Between Web Links and Trade (2006)

Ricardo Baeza-yates, Carlos Castillo

We report on observations on Web characterization studies that suggest that the amount of Web links among sites under di#erent country-code top-level domains is related to the amount of trade between...

Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms (2006)

Ricardo Baeza-yates, Paolo Boldi, Carlos Castillo

This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct...

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection (2006)

Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, Ricardo Baeza-Yates

This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in...

Link-Based Characterization and Detection of Web Spam (2006)

Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, Ricardo Baeza-Yates

We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through...

Crawling the Infinite Web (2005)

Ricardo Baeza-yates, Carlos Castillo

A large amount of the publicly available Web pages is generated dynamically upon request, and contain links to other dynamically generated pages. Many Web sites that are built with dynamic pages can...

Practical Issues of Crawling Large Web Collections (2005)

Carlos Castillo, Ricardo Baeza-yates

During large crawls of the Web, we have observed several anomalies in the implementation of the basic protocols by some Web sites. These anomalies impose costs on the design of a Web crawler and...

WIRE: an Open Source Web Information Retrieval Environment (2005)

Carlos Castillo, Ricardo Baeza-Yates

In this paper, we describe the WIRE (Web Information Retrieval Environment) project and focus on some details of its crawler component. The WIRE crawler is a scalable, highly configurable, high...

Link Analysis in National Web Domains (2005)

Ricardo Baeza-yates, Carlos Castillo

The Web can be seen as a graph in which every page is a node, and every hyper-link between two pages is an edge. This Web graph forms a scale-free network: a graph in which the distribution of the...

Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering (2005)

Ricardo Baeza-yates, Carlos Castillo, Mauricio Marin, Andrea Rodriguez

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early"...

Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering Ricardo Baeza-Yates (2005)

Rbaeza Dcc Uchile, Ricardo Baeza-yates, Mauricio Marin

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most ``important'' pages...

Characterization of national Web domains (2005)

Ricardo Baeza-yates, Carlos Castillo, Cátedra Telefónica, Efthimis N. Efthimiadis

During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying...

Modeling user search behavior (2005)

Ricardo Baeza-yates, Carlos Hurtado, Marcelo Mendoza, Georges Dupret

Web usage mining is a main research area in Web mining focused on learning about Web users and their interactions with Web sites. Main challenges in Web usage mining are the application of data...

Abstract Dynamics of the Chilean Web structure (2005)

Ricardo Baeza-yates, Barbara Poblete

In this paper we present a large scale study on the evolution of the Web structure of the Chilean domain (.cl) from 2000 to 2004, focusing on the Web site transitions in the structure. This is the...

A website mining model centered on user queries (2005)

Ricardo Baeza-yates, Barbara Poblete

Abstract. We present a model for mining user queries found within the access logs of a website and for relating this information to the website’s overall usage, structure and content. The aim of...

A website mining model centered on user queries (2005)

Ricardo Baeza-yates, Barbara Poblete

Abstract. We present a model for mining user queries found within the access logs of a website and for relating this information to the website’s overall usage, structure and content. The aim of...

Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences (2005)

Ricardo Baeza-yates, Ro Salinger, Santiago Chile

Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of Baeza-Yates. This algorithm performs on average less...

Why Spectral Retrieval Works (2005)

Bast, Holger, Majumdar, Debapriyo, Marchionini, Gary, Moffat, Alistair, Tait, John, Baeza-Yates, Ricardo, ...

We introduce the \emph{synonymy graph} as a new angle of looking at spectral retrieval techniques, including latent semantic indexing (LSI) and its many successors. The synonymy graph is defined for...

Scheduling algorithms for Web crawling (2004)

Carlos Castillo, Mauricio Marin, Andrea Rodriguez, Ricardo Baeza-yates

Abstract--- This paper presents a comparative study of strategies for crawling the Web. In particular we show that a combination of breadth-first ordering with the largest sites first is a practical...

Content-based Image Retrieval and Characterization On Specific Web Collections (2004)

Ricardo Baeza-Yates, Javier Ruiz-del-Solar, Rodrigo Verschae, C. Castillo, Carlos Hurtado

One of the challenges in image and video retrieval is the content-based retrieval of images and videos in the web. Less work has been done in this area, mainly due to scalability issues. For this...

Comparing the Characteristics of the Chilean and the Greek Web (2004)

Ricardo Baeza-yates, Carlos Castillo, Efthimis N. Efthimiadis

This article summarizes the results of a comparison between the characteristics of two public Web spaces: the pages under the .GR (Greece) domain, and the pages under the .CL (Chile) domain. We show...

Web Dynamics, Structure, and Page Quality (2004)

Ricardo Baeza-yates, Carlos Castillo, Felipe Saint-jean

Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are...

Query recommendation using query logs in search engines (2004)

Ricardo Baeza-yates, Carlos Hurtado, Marcelo Mendoza

Abstract. In this paper we propose a method that, given a query submitted to a search engine, suggests a list of related queries. The related queries are based in previously issued queries, and can...

Crawling the Infinite Web: Five Levels Are Enough (2004)

Ricardo Baeza-yates, Carlos Castillo

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create...

Query Usage Mining in Search Engines (2004)

Ricardo Baeza-yates

Search engine logs not only keep navigation information, but also the queries made by their users. In particular, queries to a search engine follow a power-law distribution, which is far from...

Crawling the infinite Web: five levels are enough (2004)

Ricardo Baeza-yates, Carlos Castillo

Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can...

Executive Summary (2004)

Ricardo Baeza-yates, Felipe Lalanne, Carlos Castillo, Georges Dupret

This report summarizes the results of a comparison between the characteristics of two public Web spaces: the pages under the.KR (South Korea) domain, and the pages under the.CL (Chile) domain. We...

Alternative implementation techniques for web text visualization (2003)

Omar Alonso, Ricardo Baeza-yates

We present an approach for building text visualizations that avoids using plug-ins or clients based on languages like Java. Instead we propose to make the search engine application more aware of the...

Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names (2003)

Gonzalo Navarro, Ricardo Baeza-yates, Azevedo Arcoverde

In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large...

XML Standards (2003)

Ricardo Baeza-yates, Norbert Fuhr, Xml Standards

• plain XML • XML namespaces • DTDs and XML schema

Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names (2003)

Gonzalo Navarro, Ricardo Baeza-yates, Azevedo Arcoverde

In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large...

Alternative implementation techniques for web text visualization (2003)

Omar Alonso, Ricardo Baeza-yates

We present an approach for building text visualizations that avoids using plug-ins or clients based on languages like Java. Instead we propose to make the search engine application more aware of the...

Lenguajes de consulta para xml: un análisis comparativo (2002)

Delgado Domínguez, Adelaida, Baeza-Yates, Ricardo

A query language for xml should be flexible enough to cover the whole range of information sources that can be labelled by xml, including databases and web documents. In this article we present a...

Web Structure, Dynamics and Page Quality (2002)

Ricardo Baeza-yates, Felipe Saint-jean, Carlos Castillo

Abstract. This paper is aimed at the study of quantitative measures of the relation between Web structure, page recency, and quality of Web pages. Quality is studied using different link-based...

Balancing volume, quality and freshness in web crawling (2002)

Ricardo Baeza-yates, Carlos Castillo

We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the...

Indexing Methods for Approximate String Matching (2001)

Gonzalo Navarro, Ricardo Baeza-yates, Erkki Sutinen, Jorma Tarhio

Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We...

Searching in metric spaces (2001)

Gonzalo Navarro, Ricardo Baeza-yates

The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from...

Distributed query processing using partitioned inverted files (2001)

Claudine Badue, Ricardo Baeza-yates, Berthier Ribeiro-neto, Nivio Ziviani

In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system...

Distributed query processing using partitioned inverted files (2001)

Claudine Badue, Berthier Ribeiro-neto, Ricardo Baeza-yates, Nivio Ziviani

In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system...

New models and algorithms for multidimensional approximate pattern matching (2000)

Ricardo Baeza-yates, Gonzalo Navarro

ABSTRACT: We focus on how to compute the edit distance (or similarity) between two images and the problem of approximate string matching in two dimensions, that is, to find a pattern of size mm in a...

Compression: A key for next-generation text retrieval systems (2000)

Nivio Ziviani, Edleno Silva Moura, Gonzalo Navarro, Ricardo Baeza-yates

In this article we discuss recent methods for compressing the text and the index of text retrieval systems. By compressing both the complete text and the index, the total amount of space is less than...

Binary searching with non-uniform costs and its application to text retrieval (2000)

Gonzalo Navarro, Eduardo Fernandes Barbosa, Ricardo Baeza-yates, Walter Cunto, Nivio Ziviani

We study the problem of minimizing the expected cost of binary searching for data where the access cost is not fixed and depends on the last accessed element, such as data stored in magnetic or...

A hybrid indexing method for approximate string matching (2000)

Gonzalo Navarro, Ricardo Baeza-yates

ABSTRACT: We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting...

XQL and proximal nodes (2000)

Ricardo Baeza-yates, Gonzalo Navarro

Despite that several models to structure text documents and to query on this structure have been proposed in the past, a standard has emerged only relatively recently with the introduction of XML and...

Adding Compression to Block Addressing Inverted Indexes (2000)

Gonzalo Navarro, Marden Neubert, Nivio Ziviani, Ricardo Baeza-yates

Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text...

Faster approximate string matching (1999)

Ricardo Baeza-yates, Gonzalo Navarro

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as...

A new indexing method for approximate string matching (1999)

Gonzalo Navarro, Ricardo Baeza-yates

Abstract. We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting...

Very fast and simple approximate string matching (1999)

Gonzalo Navarro, Ricardo Baeza-yates

Abstract We improve the fastest known algorithm for approximate string matching. This algorithm can only be used for low error levels. By using a new algorithm to verify potential matches and a new...

A uni model for similarity searching (1999)

Gonzalo Navarro, Ricardo Baeza-yates, Gonzalo Navarro, Ricardo Baeza-yates

Abstract. The indexing algorithms and data structures for similarity seaching in metric spaces seem to emerge from a great diversity, and different approaches have been proposed and analyzed...

New and faster filters for multiple approximate string matching. Random Structures and Algorithms (1998)

Ricardo Baeza-yates, Gonzalo Navarro

We present three new algorithms for on-line multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved...

Fast multi-dimensional approximate pattern matching (1998)

Gonzalo Navarro, Ricardo Baeza-yates

Abstract. We address the problem of approximate string matching in d dimensions, that is, to find a pattern of size m

A practical q-gram index for text retrieval allowing errors (1998)

Gonzalo Navarro, Ricardo Baeza-yates

We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to...

Fast approximate string matching in a dictionary (1998)

Ricardo Baeza-yates, Gonzalo Navarro

A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that online search, we index the vocabulary as...

Improving an Algorithm for Approximate Pattern Matching (1998)

Gonzalo Navarro, Ricardo Baeza-yates

We study a recent algorithm for fast on-line approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on...

Fast Searching on Compressed Text Allowing Errors (1998)

Gonzalo Navarro, Nivio Ziviani, Ricardo Baeza-yates

We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme...

A Model and a Visual Query Language for Structured Text (1998)

Ricardo Baeza-Yates, Gonzalo Navarro, Jesús Vegas, Depto De Inform'atica

We present a new model to query document databases by content and structure. The main merits of the model are: it allows rich structure in the documents; the query algebra is intuitive (moreover,...

A practical index for text retrieval allowing errors (1997)

Ricardo Baeza-yates, Gonzalo Navarro

We propose a text indexing technique for approximate pattern matching, which is practical and especially aimed at Information Retrieval (IR). Unlike other indices of this kind, it is able to retrieve...

Block-addressing indices for approximate text retrieval (1997)

Ricardo Baeza-yates, Gonzalo Navarro

Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point...

Proximal Nodes: A Model to Query Document Databases by Content and Structure (1997)

Gonzalo Navarro, Ricardo Baeza-yates

A model to query document databases by both their content and structure is presented. The

Multiple approximate string matching (1997)

Ricardo Baeza-yates, Gonzalo Navarro

Abstract. We present two new algorithms for on-line multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The single-pattern version of...

Block-addressing indices for approximate text retrieval (1997)

Ricardo Baeza-yates, Gonzalo Navarro

We focus on indices for approximate word retrieval. This is an issue which is gaining importance in the last years. At the present time only a few indexing schemes, mostly based on inverted lists,...

Block Addressing Indices for Approximate Text Retrieval (1997)

Ricardo Baeza-yates, Gonzalo Navarro

The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as...

Proximal Nodes: A Model to Query Document Databases by Content and Structure (1997)

Gonzalo Navarro, Ricardo Baeza-yates

this paper is to present a model to structure and query document databases, following the new trend of mixing content and structure in queries. The model is shown to be expressive and efficiently...

Indexing Methods for Approximate Text Retrieval (Extended Abstract) (1997)

Ricardo Baeza-yates, Gonzalo Navarro, Erkki Sutinen, Jorma Tarhio

While the problem of on-line approximate string matching is well studied, only recently the first off-line indexing techniques have emerged. We study the different indexing mechanisms for this...

A class of linear algorithms to process sets of segments (1996)

Gonzalo Navarro, Ricardo Baeza-yates

We address the problem of efficiently performing operations on sets of segments. While current solutions to operate segments focus on single operations (e.g. insertion or searching), we are...

A Fast Heuristic for Approximate String Matching (1996)

Ricardo Baeza-yates, Gonzalo Navarro

This work has been supported in part by FONDECYT grants 1950622 and 1960881. Abstract. We study a fast algorithm for on-line approximate string matching. It is based on a non-deterministic finite...

A Faster Algorithm for Approximate String Matching (1996)

Ricardo Baeza-yates, Gonzalo Navarro

Abstract. We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the...

A Faster Algorithm for Approximate String Matching (1996)

Ricardo Baeza-yates

Approximate string matching is one of the main problems in classical string algorithms. Given a text of length n, a pattern of length m, and a maximal number of errors allowed, k, we want to find all...

A Unified View to String Matching Algorithms (1996)

Ricardo Baeza-yates

. We present a unified view to sequential algorithms for many pattern matching problems, using a finite automaton built from the pattern which uses the text as input. We show the limitations of...

An Extended Model for Full Text Databases (1996)

Ricardo Baeza-yates

Query languages of full text retrieval systems are based on several assumptions about the input text. Many systems assume that the input has some basic structure (say, documents and words) which...

A Unified View to String Matching Algorithms (1996)

Ricardo Baeza-yates

We present a unified view to sequential algorithms for many pattern matching problems, using a finite automaton built from the pattern which uses the text as input. We show the limitations of...

Visualization of Large Answers in Text Databases (1996)

Ricardo Baeza-yates

Current user interfaces of full text retrieval systems do not help in the process of filtering the result of a query, usually very large. We address this problem and we propose a visual interface to...

Hierarchies Of Indices For Text Searching (1996)

Ricardo Baeza-yates, Eduardo F. Barbosa, Nivio Ziviani

We present an efficient implementation of a recently known index for text databases, when the database is stored on secondary storage devices such as magnetic or optical disks. The implementation is...

A Faster Algorithm for Approximate String Matching (Extended Abstract) (1996)

Ricardo Baeza-yates, Gonzalo Navarro

Ricardo Baeza-Yates Gonzalo Navarro Department of Computer Science University of Chile Blanco Encalada 2120 - Santiago - Chile frbaeza,gnavarrog@dcc.uchile.cl Abstract We present a new algorithm for...

Expressive power of a new model for structured text databases (1995)

Gonzalo Navarro, Ricardo Baeza-yates

This paper studies the expressivity of a new model for structuring and querying textual databases by both the structure and contents of the text. The key idea of the model is a set-oriented query...

On Boyer-Moore Automata (1994)

Ricardo Baeza-Yates, Christian Choffrut, Gaston H. Gonnet

The notion of Boyer-Moore automaton was introduced by Knuth, Morris and Pratt in their historical paper on fast pattern matching. It leads to an algorithm that requires more preprocessing but is more...

Fast Two Dimensional Pattern Matching (1993)

Ricardo Baeza-yates, Mireille Régnier

An algorithm for searching for a two dimensional m \Theta m pattern in a two dimensional n \Theta n text is presented. It performs on the average less comparisons than the size of the text: n 2 =m...

Improved Bounds for the Expected Behaviour of AVL Trees (1992)

Ricardo Baeza-yates, Gaston H. Gonnet, Nivio Ziviani, Belo Horizonte, Minas Gerais, Minas Gerais

In this paper we improve previous bounds on expected measures of AVL trees by using fringe analysis. A new way of handling larger tree collections that are not closed is presented. An inherent...