Aigaion: RACTI / RU1 Technical Report Series (Web Based)

[RACTI-RU1-2010-33] Athanassopoulos, Stavros, Kaklamanis, Christos, Laftsidis, Ilias and Papaioannou, Evi, An Experimental Study of Greedy Routing Algorithms, in: The 2010 International Confernce onHigh Performance Computing & Simulation (HPCS), France, 2010.
Abstract: The “small world” phenomenon, i.e., the fact that the global social network is strongly connected in the sense that every two persons are inter-related through a small chain of friends, has attracted research attention and has been strongly related to the results of the social psychologist�s Stanley Milgram experiments; properties of social networks and relevant problems also emerge in peer-to-peer systems and their study can shed light on important modern network design properties. In this paper, we have experimentally studied greedy routing algorithms, i.e., algorithms that route information using “long-range” connections that function as shortcuts connecting “distant” network nodes. In particular, we have implemented greedy routing algorithms, and techniques from the recent literature in networks of line and grid topology using parallelization for increasing efficiency. To the best of our knowledge, no similar attempt has been made so far
[RACTI-RU1-2009-87] Ntarmos, Nikos, Triantafillou, Peter and Weikum, Gerhard, Distributed Hash Sketches: Scalable, Efficient, and Accurate Cardinality Estimation for Distributed Multisets, in: ACM Transactions on Computer Systems, ACM TOCS, 2009.
Abstract: Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHSis based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.
[RACTI-RU1-2006-15] Michel, Sebastian, Bender, Matthias, Triantafillou, Peter and Weikum, Gerhard, Global Document Frequency Estimation in Peer-to-Peer Web Search, in: 9th International Workshop on the Web and Databases (WebDB 2006), pages 62-67, 2006.
Abstract: Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as term-specific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a large-scale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the lo- cal document collections of autonomous peers may arbitrar- ily overlap, so that global counting needs to be duplicate- insensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental stud- ies demonstrate the estimator?s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with real-world Web data and queries.
[RACTI-RU1-2006-17] Michel, Sebastian, Bender, Matthias, Triantafillou, Peter and Weikum, Gerhard, IQN Routing: Integrating Quality and Novelty in P2P Querying and Ranking, in: 10th International Conference on Extending Database Technology (EDBT 2006), pages 62-67, 2006.
Abstract: Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as term-specific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a large-scale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the lo- cal document collections of autonomous peers may arbitrar- ily overlap, so that global counting needs to be duplicate- insensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental stud- ies demonstrate the estimator?s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with real-world Web data and queries.
[RACTI-RU1-2007-32] Pitoura, Theoni and Triantafillou, Peter, Load Distribution Fairness in P2P Data Management Systems, in: the IEEE 23rd International Conference on Data Engineering (ICDE 2007), pages 396-405, 2007.
Abstract: We address the issue of measuring storage, or query load distribution fairness in peer-to-peer data management systems. Existing metrics may look promising from the point of view of specific peers, while in reality being far from optimal from a global perspective. Thus, first we define the requirements and study the appropriateness of various statistical metrics for measuring load distribution fairness towards these requirements. The metric proposed as most appropriate is the Gini coefficient (G). Second, we develop novel distributed sampling algorithms to compute G on-line, with high precision, efficiently, and scalably. Third, we show how G can readily be utilized on-line by higher-level algorithms which can now know when to best intervene to correct load imbalances. Our analysis and experiments testify for the efficiency and accuracy of these algorithms, permitting the online use of a rich and reliable metric, conveying a global perspective of the distribution.
[RACTI-RU1-2009-88] Ntarmos, Nikos, Triantafillou, Peter and Weikum, Gerhard, Statistical Structures for Internet-Scale Data Management, in: Statistical Structures for Internet-Scale Data Management, 2009.
Abstract: Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statisticsmanagement still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability.
[RACTI-RU1-2003-38] Triantafillou, Peter and Pitoura, Theoni, Towards a Unifying Framework for Complex Query Processing over Structured Peer-to-Peer Data Networks, in: Databases, Information Systems, and Peer-to-Peer Computing (DBISP2P 2003), pages 169-183, Springer, Berlin, Germany, 2003. [DOI]
Abstract: In this work we study how to process complex queries in DHT-based Peer-to-Peer (P2P) data networks. Queries are made over tuples and relations and are expressed in a query language, such as SQL. We describe existing research approaches for query processing in P2P systems, we suggest improvements and enhancements, and propose a unifying framework that consists of a modified DHT architecture, data placement and search algorithms, and provides efficient support for processing a variety of query types, including queries with one or more attributes, queries with selection operators (involving equality and range queries), and queries with join operators. To our knowledge, this is the first work that puts forth a framework providing support for all these query types.
[RACTI-RU1-2003-31] Triantafillou, Peter, Xiruhaki, C., Koubarakis, Manolis and Ntarmos, Nikos, Towards High Performance Peer-to-Peer Content and Resource Sharing Systems, in: Conference on Innovative Data Systems Research (CIDR 2003), Asilomar, CA, 2003.
Abstract: Peer-to-peer sharing systems are becoming increasingly popular and an exciting new class of innovative, internet-based data management systems. In these systems, users contribute their own resources (processing units and storage devices) and content (i.e., documents) to the P2P community. We focus on the management of content and resources in such systems. Our goal is to harness all available resources in the P2P network so that the users can access all available content efficiently. Efficiency is taken both from (i) the point of view of the system, in that we strive to ensure fair load distribution among all peer nodes, and (ii) from the point of view of the users, in that we strive to ensure low user-request response times. We propose a novel architecture for this new class of applications, which differs drastically from what is either found currently in existing products or proposed in academia. We contribute and study novel solutions that achieve our goals, while at the same time addressing the formidable challenges due to the autonomy of peers, their heterogeneous processing and storage capacities, their different content contributions, the huge system scale, and the highly dynamic system environment.