Abstract: Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network.
Abstract: This paper addresses the efficient processing of
top-k queries in wide-area distributed data
repositories where the index lists for the attribute
values (or text terms) of a query are distributed
across a number of data peers and the
computational costs include network latency,
bandwidth consumption, and local peer work.
We present KLEE, a novel algorithmic
framework for distributed top-k queries,
designed for high performance and flexibility.
KLEE makes a strong case for approximate top-k
algorithms over widely distributed data sources.
It shows how great gains in efficiency can be
enjoyed at low result-quality penalties. Further,
KLEE affords the query-initiating peer the
flexibility to trade-off result quality and expected
performance and to trade-off the number of
communication phases engaged during query
execution versus network bandwidth
performance. We have implemented KLEE and
related algorithms and conducted a
comprehensive performance evaluation. Our
evaluation employed real-world and synthetic
large, web-data collections, and query
benchmarks. Our experimental results show that
KLEE can achieve major performance gains in
terms of network bandwidth, query response
times, and much lighter peer loads, all with small
errors in result precision and other result-quality
measures
Abstract: The promises inherent in users coming together to form data
sharing network communities, bring to the foreground new problems formulated
over such dynamic, ever growing, computing, storage, and networking
infrastructures. A key open challenge is to harness these highly
distributed resources toward the development of an ultra scalable, efficient
search engine. From a technical viewpoint, any acceptable solution
must fully exploit all available resources dictating the removal of any
centralized points of control, which can also readily lead to performance
bottlenecks and reliability/availability problems. Equally importantly,
however, a highly distributed solution can also facilitate pluralism in informing
users about internet content, which is crucial in order to preclude
the formation of information-resource monopolies and the biased visibility
of content from economically-powerful sources. To meet these challenges,
the work described here puts forward MINERVA{\^a}{\"i}¿½{\"i}¿½, a novel search
engine architecture, designed for scalability and efficiency. MINERVA{\^a}{\"i}¿½{\"i}¿½
encompasses a suite of novel algorithms, including algorithms for creating
data networks of interest, placing data on network nodes, load balancing,
top-k algorithms for retrieving data at query time, and replication algorithms
for expediting top-k query processing. We have implemented the
proposed architecture and we report on our extensive experiments with
real-world, web-crawled, and synthetic data and queries, showcasing the
scalability and efficiency traits of MINERVA{\^a}{\"i}¿½{\"i}¿½.
Abstract: The peer-to-peer computing paradigm is an intriguing alternative to Google-style search
engines for querying and ranking Web content. In a network with many thousands or
millions of peers the storage and access load requirements per peer are much lighter
than for a centralized Google-like server farm; thus more powerful techniques from information
retrieval, statistical learning, computational linguistics, and ontological reasoning
can be employed on each peer¢s local search engine for boosting the quality
of search results. In addition, peers can dynamically collaborate on advanced and particularly
difficult queries. Moroever, a peer-to-peer setting is ideally suited to capture
local user behavior, like query logs and click streams, and disseminate and aggregate
this information in the network, at the discretion of the corresponding user, in order to
incorporate richer cognitive models.
This paper gives an overview of ongoing work in the EU Integrated Project DELIS
that aims to develop foundations for a peer-to-peer search engine with Google-or-better
scale, functionality, and quality, which will operate in a completely decentralized and
self-organizing manner. The paper presents the architecture of such a system and the
Minerva prototype testbed, and it discusses various core pieces of the approach: efficient
execution of top-k ranking queries, strategies for query routing when a search request
needs to be forwarded to other peers, maintaining a self-organizing semantic overlay
network, and exploiting and coping with user and community behavior.