Abstract: Counting in general, and estimating the cardinality of (multi-) sets in particular, is highly desirable for a large variety of applications, representing a foundational block for the efficient deployment and access of emerging internet-scale information systems. Examples of such applications range from optimizing query access plans in internet-scale databases, to evaluating the significance (rank/score) of various data items in information retrieval applications. The key constraints that any acceptable solution must satisfy are: (i) efficiency: the number of nodes that need be contacted for counting purposes must be small in order to enjoy small latency and bandwidth requirements; (ii) scalability, seemingly contradicting the efficiency goal: arbitrarily large numbers of nodes nay need to add elements to a (multi-) set, which dictates the need for a highly distributed solution, avoiding server-based scalability, bottleneck, and availability problems; (iii) access and storage load balancing: counting and related overhead chores should be distributed fairly to the nodes of the network; (iv) accuracy: tunable, robust (in the presence of dynamics and failures) and highly accurate cardinality estimation; (v) simplicity and ease of integration: special, solution-specific indexing structures should be avoided. In this paper, first we contribute a highly-distributed, scalable, efficient, and accurate (multi-) set cardinality estimator. Subsequently, we show how to use our solution to build and maintain histograms, which have been a basic building block for query optimization for centralized databases, facilitating their porting into the realm of internet-scale data networks.
Abstract: Wireless sensor networks are comprised of a vast number of devices, situated in an area of interest that self organize in a structureless network, in order to monitor/record/measure an environmental variable or phenomenon and subsequently to disseminate the data to the control center.
Here we present research focused on the development, simulation and evaluation of energy efficient algorithms, our basic goal is to minimize the energy consumption. Despite technology advances, the problem of energy use optimization remains valid since current and emerging hardware solutions fail to solve it.
We aim to reduce communication cost, by introducing novel techniques that facilitate the development of new algorithms. We investigated techniques of distributed adaptation of the operations of a protocol by using information available locally on every node, thus through local choices we improve overall performance. We propose techniques for collecting and exploiting limited local knowledge of the network conditions. In an energy efficient manner, we collect additional information which is used to achieve improvements such as forming energy efficient, low latency and fault tolerant paths to route data. We investigate techniques for managing mobility in networks where movement is a characteristic of the control center as well as the sensors. We examine methods for traversing and covering the network field based on probabilistic movement that uses local criteria to favor certain areas.
The algorithms we develop based on these techniques operate a) at low level managing devices, b) on the routing layer and c) network wide, achieving macroscopic behavior through local interactions. The algorithms are applied in network cases that differ in density, node distribution, available energy and also in fundamentally different models, such as under faults, with incremental node deployment and mobile nodes. In all these settings our techniques achieve significant gains, thus distinguishing their value as tools of algorithmic design.
Abstract: Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network.
Abstract: With this work we aim to make a three-fold contribution.
We first address the issue of supporting efficiently queries
over string-attributes involving prefix, suffix, containment,
and equality operators in large-scale data networks. Our
first design decision is to employ distributed hash tables
(DHTs) for the data network?s topology, harnessing their
desirable properties. Our next design decision is to derive
DHT-independent solutions, treating DHT as a black box.
Second, we exploit this infrastructure to develop efficient
content based publish/subscribe systems. The main con-
tribution here are algorithms for the efficient processing of
queries (subscriptions) and events (publications). Specifi-
cally, we show that our subscription processing algorithms
require O(logN) messages for a N-node network, and our
event processing algorithms require O(l ? logN) messages
(with l being the average string length).
Third, we develop algorithms for optimizing the proces-
sing of multi-dimensional events, involving several string at-
tributes. Further to our analysis, we provide simulation-
based experiments showing promising performance results
in terms of number of messages, required bandwidth, load
balancing, and response times.
Abstract: The promises inherent in users coming together to form data
sharing network communities, bring to the foreground new problems formulated
over such dynamic, ever growing, computing, storage, and networking
infrastructures. A key open challenge is to harness these highly
distributed resources toward the development of an ultra scalable, efficient
search engine. From a technical viewpoint, any acceptable solution
must fully exploit all available resources dictating the removal of any
centralized points of control, which can also readily lead to performance
bottlenecks and reliability/availability problems. Equally importantly,
however, a highly distributed solution can also facilitate pluralism in informing
users about internet content, which is crucial in order to preclude
the formation of information-resource monopolies and the biased visibility
of content from economically-powerful sources. To meet these challenges,
the work described here puts forward MINERVA{\^a}{\"i}申{\"i}申, a novel search
engine architecture, designed for scalability and efficiency. MINERVA{\^a}{\"i}申{\"i}申
encompasses a suite of novel algorithms, including algorithms for creating
data networks of interest, placing data on network nodes, load balancing,
top-k algorithms for retrieving data at query time, and replication algorithms
for expediting top-k query processing. We have implemented the
proposed architecture and we report on our extensive experiments with
real-world, web-crawled, and synthetic data and queries, showcasing the
scalability and efficiency traits of MINERVA{\^a}{\"i}申{\"i}申.
Abstract: This paper studies the data gathering problem in wireless networks, where data generated at the nodes has to be collected at a single sink. We investigate the relationship between routing optimality and fair resource management. In particular, we prove that for energy balanced data propagation, Pareto optimal routing and flow maximization are equivalent, and also prove that flow maximization is equivalent to maximizing the network lifetime. We algebraically characterize the network structures in which energy balanced data flows are maximal. Moreover, we algebraically characterize communication links which are not used by an optimal flow. This leads to the characterization of minimal network structures supporting the maximal flows.
We note that energy balance, although implying global optimality, is a local property that can be computed efficiently and in a distributed manner. We suggest online distributed algorithms for energy balance in different optimal network structures and numerically show their stability in particular setting. We remark that although the results obtained in this paper have a direct consequence in energy saving for wireless networks they do not limit themselves to this type of networks neither to energy as a resource. As a matter of fact, the results are much more general and can be used for any type of network and different type of resources.
Abstract: This paper studies the data gathering problem in wireless networks, where data generated at the nodes has to be collected at a single sink. We investigate the relationship between routing optimality and fair resource management. In particular, we prove that for energy-balanced data propagation, Pareto optimal routing and flow maximization are equivalent, and also prove that flow maximization is equivalent to maximizing the network lifetime. We algebraically characterize the network structures in which energy-balanced data flows are maximal. Moreover, we algebraically characterize communication links which are not used by an optimal flow. This leads to the characterization of minimal network structures supporting the maximal flows.
We note that energy-balance, although implying global optimality, is a local property that can be computed efficiently and in a distributed manner. We suggest online distributed algorithms for energy-balance in different optimal network structures and numerically show their stability in particular setting. We remark that although the results obtained in this paper have a direct consequence in energy saving for wireless networks they do not limit themselves to this type of networks neither to energy as a resource. As a matter of fact, the results are much more general and can be used for any type of network and different types of resources.
Abstract: Efficient query processing in traditional database
management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statisticsmanagement still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured
overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge.
To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation
of these tools for distributed statistical synopses,
and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability.