Abstract: The “small world” phenomenon, i.e., the fact that the
global social network is strongly connected in the sense
that every two persons are inter-related through a small
chain of friends, has attracted research attention and has
been strongly related to the results of the social
psychologist¢s Stanley Milgram experiments; properties
of social networks and relevant problems also emerge in
peer-to-peersystems and their study can shed light on
important modern network design properties.
In this paper, we have experimentally studied greedy
routing algorithms, i.e., algorithms that route information
using “long-range” connections that function as
shortcuts connecting “distant” network nodes. In
particular, we have implemented greedy routing
algorithms, and techniques from the recent literature in
networks of line and grid topology using parallelization
for increasing efficiency. To the best of our knowledge, no
similar attempt has been made so far
Abstract: Counting items in a distributed system, and estimating the cardinality of multisets in particular,
is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name
containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are:
(1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load
balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy
integration in applications. This article contributes the DHS (Distributed Hash Sketches) method
for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator.
DHSis based on hash sketches for probabilistic counting, but distributes the bits of each counter
across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying
careful attention to fast access and aggregation as well as update costs. The article discusses various
design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and
load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.
Abstract: Information retrieval (IR) in peer-to-peer (P2P) networks,
where the corpus is spread across many loosely coupled
peers, has recently gained importance. In contrast to IR
systems on a centralized server or server farm, P2P IR faces
the additional challenge of either being oblivious to global
corpus statistics or having to compute the global measures
from local statistics at the individual peers in an efficient,
distributed manner. One specific measure of interest is the
global document frequency for different terms, which would
be very beneficial as term-specific weights in the scoring and
ranking of merged search results that have been obtained
from different peers.
This paper presents an efficient solution for the problem
of estimating global document frequencies in a large-scale
P2P network with very high dynamics where peers can join
and leave the network on short notice. In particular, the
developed method takes into account the fact that the lo-
cal document collections of autonomous peers may arbitrar-
ily overlap, so that global counting needs to be duplicate-
insensitive. The method is based on hash sketches as a
technique for compact data synopses. Experimental stud-
ies demonstrate the estimator?s accuracy, scalability, and
ability to cope with high dynamics. Moreover, the benefit
for ranking P2P search results is shown by experiments with
real-world Web data and queries.
Abstract: Information retrieval (IR) in peer-to-peer (P2P) networks,
where the corpus is spread across many loosely coupled
peers, has recently gained importance. In contrast to IR
systems on a centralized server or server farm, P2P IR faces
the additional challenge of either being oblivious to global
corpus statistics or having to compute the global measures
from local statistics at the individual peers in an efficient,
distributed manner. One specific measure of interest is the
global document frequency for different terms, which would
be very beneficial as term-specific weights in the scoring and
ranking of merged search results that have been obtained
from different peers.
This paper presents an efficient solution for the problem
of estimating global document frequencies in a large-scale
P2P network with very high dynamics where peers can join
and leave the network on short notice. In particular, the
developed method takes into account the fact that the lo-
cal document collections of autonomous peers may arbitrar-
ily overlap, so that global counting needs to be duplicate-
insensitive. The method is based on hash sketches as a
technique for compact data synopses. Experimental stud-
ies demonstrate the estimator?s accuracy, scalability, and
ability to cope with high dynamics. Moreover, the benefit
for ranking P2P search results is shown by experiments with
real-world Web data and queries.
Abstract: We address the issue of measuring storage, or query load distribution fairness in peer-to-peer data management systems. Existing metrics may look promising from the point of view of specific peers, while in reality being far from optimal from a global perspective. Thus, first we define the requirements and study the appropriateness of various statistical metrics for measuring load distribution fairness towards these requirements. The metric proposed as most appropriate is the Gini coefficient (G). Second, we develop novel distributed sampling algorithms to compute G on-line, with high precision, efficiently, and scalably. Third, we show how G can readily be utilized on-line by higher-level algorithms which can now know when to best intervene to correct load imbalances. Our analysis and experiments testify for the efficiency and accuracy of these algorithms, permitting the online use of a rich and reliable metric, conveying a global perspective of the distribution.
Abstract: Efficient query processing in traditional database
management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statisticsmanagement still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured
overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge.
To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation
of these tools for distributed statistical synopses,
and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability.
Abstract: In this work we study how to process complex queries in DHT-based
Peer-to-Peer (P2P) data networks. Queries are made over tuples and relations
and are expressed in a query language, such as SQL. We describe existing
research approaches for query processing in P2P systems, we suggest
improvements and enhancements, and propose a unifying framework that
consists of a modified DHT architecture, data placement and search algorithms,
and provides efficient support for processing a variety of query types, including
queries with one or more attributes, queries with selection operators (involving
equality and range queries), and queries with join operators. To our knowledge,
this is the first work that puts forth a framework providing support for all these
query types.
Abstract: Peer-to-peer sharing systems are becoming
increasingly popular and an exciting new class of
innovative, internet-based data management
systems. In these systems, users contribute their
own resources (processing units and storage
devices) and content (i.e., documents) to the P2P
community. We focus on the management of
content and resources in such systems. Our goal
is to harness all available resources in the P2P
network so that the users can access all available
content efficiently. Efficiency is taken both from
(i) the point of view of the system, in that we
strive to ensure fair load distribution among all
peer nodes, and (ii) from the point of view of the
users, in that we strive to ensure low user-request
response times.
We propose a novel architecture for this new
class of applications, which differs drastically
from what is either found currently in existing
products or proposed in academia. We contribute
and study novel solutions that achieve our goals,
while at the same time addressing the formidable
challenges due to the autonomy of peers, their
heterogeneous processing and storage capacities,
their different content contributions, the huge
system scale, and the highly dynamic system
environment.