Aigaion: RACTI / RU1 Technical Report Series (Web Based)

[RACTI-RU1-2009-90] Neumann, Thomas, Bender, Matthias, Michel, Sebastian, Schenkel, Ralf, Triantafillou, Peter and Weikum, Gerhard, Distributed top-k aggregation queries at large, in: Distributed and Parallel Databases, DAPD, 2009.
Abstract: Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network.
[RACTI-RU1-2006-15] Michel, Sebastian, Bender, Matthias, Triantafillou, Peter and Weikum, Gerhard, Global Document Frequency Estimation in Peer-to-Peer Web Search, in: 9th International Workshop on the Web and Databases (WebDB 2006), pages 62-67, 2006.
Abstract: Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as term-specific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a large-scale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the lo- cal document collections of autonomous peers may arbitrar- ily overlap, so that global counting needs to be duplicate- insensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental stud- ies demonstrate the estimator?s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with real-world Web data and queries.
[RACTI-RU1-2005-11] Aekaterinidis, Ioannis and Triantafillou, Peter, Internet Scale String Attribute Publish/Subscribe Data Networks, in: ACM 14nth Conference on Information and Knowledge Management (CIKM 2005), Bremen, Germany, 2005.
Abstract: With this work we aim to make a three-fold contribution. We first address the issue of supporting efficiently queries over string-attributes involving prefix, suffix, containment, and equality operators in large-scale data networks. Our first design decision is to employ distributed hash tables (DHTs) for the data network?s topology, harnessing their desirable properties. Our next design decision is to derive DHT-independent solutions, treating DHT as a black box. Second, we exploit this infrastructure to develop efficient content based publish/subscribe systems. The main con- tribution here are algorithms for the efficient processing of queries (subscriptions) and events (publications). Specifi- cally, we show that our subscription processing algorithms require O(logN) messages for a N-node network, and our event processing algorithms require O(l ? logN) messages (with l being the average string length). Third, we develop algorithms for optimizing the proces- sing of multi-dimensional events, involving several string at- tributes. Further to our analysis, we provide simulation- based experiments showing promising performance results in terms of number of messages, required bandwidth, load balancing, and response times.
[RACTI-RU1-2006-17] Michel, Sebastian, Bender, Matthias, Triantafillou, Peter and Weikum, Gerhard, IQN Routing: Integrating Quality and Novelty in P2P Querying and Ranking, in: 10th International Conference on Extending Database Technology (EDBT 2006), pages 62-67, 2006.
Abstract: Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as term-specific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a large-scale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the lo- cal document collections of autonomous peers may arbitrar- ily overlap, so that global counting needs to be duplicate- insensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental stud- ies demonstrate the estimator?s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with real-world Web data and queries.
[RACTI-RU1-2005-12] Michel, Sebastian, Triantafillou, Peter and Weikum, Gerhard, KLEE: A Framework for Distributed Top-K Query Algorithms, in: 31st International Conference on Very Large Data Bases (VLDB 2005), 2005.
Abstract: This paper addresses the efficient processing of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We present KLEE, a novel algorithmic framework for distributed top-k queries, designed for high performance and flexibility. KLEE makes a strong case for approximate top-k algorithms over widely distributed data sources. It shows how great gains in efficiency can be enjoyed at low result-quality penalties. Further, KLEE affords the query-initiating peer the flexibility to trade-off result quality and expected performance and to trade-off the number of communication phases engaged during query execution versus network bandwidth performance. We have implemented KLEE and related algorithms and conducted a comprehensive performance evaluation. Our evaluation employed real-world and synthetic large, web-data collections, and query benchmarks. Our experimental results show that KLEE can achieve major performance gains in terms of network bandwidth, query response times, and much lighter peer loads, all with small errors in result precision and other result-quality measures
[RACTI-RU1-2005-8] Michel, Sebastian, Triantafillou, Peter and Weikum, Gerhard, MINERVAâï��ï��: A Scalable Efficient Peer-to-Peer Search Engine, in: ACM/IFIP/USENIX 6th International Middleware Conference, Middleware 2005, 2005.
Abstract: The promises inherent in users coming together to form data sharing network communities, bring to the foreground new problems formulated over such dynamic, ever growing, computing, storage, and networking infrastructures. A key open challenge is to harness these highly distributed resources toward the development of an ultra scalable, efficient search engine. From a technical viewpoint, any acceptable solution must fully exploit all available resources dictating the removal of any centralized points of control, which can also readily lead to performance bottlenecks and reliability/availability problems. Equally importantly, however, a highly distributed solution can also facilitate pluralism in informing users about internet content, which is crucial in order to preclude the formation of information-resource monopolies and the biased visibility of content from economically-powerful sources. To meet these challenges, the work described here puts forward MINERVA{\^a}{\"i}��{\"i}��, a novel search engine architecture, designed for scalability and efficiency. MINERVA{\^a}{\"i}��{\"i}�� encompasses a suite of novel algorithms, including algorithms for creating data networks of interest, placing data on network nodes, load balancing, top-k algorithms for retrieving data at query time, and replication algorithms for expediting top-k query processing. We have implemented the proposed architecture and we report on our extensive experiments with real-world, web-crawled, and synthetic data and queries, showcasing the scalability and efficiency traits of MINERVA{\^a}{\"i}��{\"i}��.
[RACTI-RU1-2006-8] Bender, Matthias, Michel, Sebastian, Triantafillou, Peter, Weikum, Gerhard and Zimmer, Christian, P2P Content Search: Give the Web Back to the People, in: 5th International Workshop on Peer-to-Peer Systems (IPTPS 2006), 2006.
Abstract: The proliferation of peertopeer (P2P) systems has come with various compelling applications including file sharing based on distributed hash tables (DHTs) or other kinds of overlay networks. Searching the content of files (especially Web Search) requires multikeyword querying with scoring and ranking. Existing approaches have no way of taking into account the correlation between the keywords in the query. This paper presents our solution that incorporates the queries and behavior of the users in the P2P network such that interesting correlations can be inferred.
[RACTI-RU1-2005-7] Michel, Sebastian, Bender, Matthias, Weikum, Gerhard, Zimmer, Christian and Triantafillou, Peter, P2P web search with MINERVA: How do you want to search tomorrow, in: ACM/IFIP/USENIX 6th International Middleware Conference, 2005.
Abstract: MINERVA1 is a novel approach towards P2P Web search that connects an a-priori unlimited number of peers, each of which maintains a personal local database and a local search facility. Each peer posts a small amount of metadata to a physically distributed directory layered on top of a DHT-based overlay network that is used to efficiently select promising peers from across the peer population that can best locally execute a query. This paper proposes a live demonstration of MINERVA, showcasing the full information lifecycle: crawling web pages, disseminating metadata to a distributed directory, and executing queries online. We additionally invite all visitors to instantly join the network by executing a small piece of software.
[RACTI-RU1-2006-11] Pitoura, Theoni, Ntarmos, Nikos and Triantafillou, Peter, Replication, Load Balancing, and Efficient Range Query processing in DHT Data Networks, in: 10th International Conference on Extending Database Technology (EDBT 2006), 2006.
Abstract: We consider the conflicting problems of ensuring data-access load balancing and efficiently processing range queries on peer-to-peer data networks maintained over Distributed Hash Tables (DHTs). Placing consecutive data values in neighboring peers is frequently used in DHTs since it accelerates range query processing. However, such a placement is highly susceptible to load imbalances, which are preferably handled by replicating data (since replication also introduces fault tolerance benefits). In this paper, we present HotRoD, a DHT-based architecture that deals effectively with this combined problem through the use of a novel locality-preserving hash function, and a tunable data replication mechanism which allows trading off replication costs for fair load distribution. Our detailed experimentation study shows strong gains in both range query processing efficiency and data-access load balancing, with low replication overhead. To our knowledge, this is the first work that concurrently addresses the two conflicting problems using data replication.
[RACTI-RU1-2005-55] Pitoura, Theoni, Ntarmos, Nikos and Triantafillou, Peter, Replication, Load Balancing, and Efficient Range Query Processing in DHTs, in: 4th Hellenic Data Management Symposium (HDMS 2005), pages 131-148, Springer Berlin / Heidelberg, 2005. [DOI]
Abstract: We consider the conflicting problems of ensuring data-access load balancing and efficiently processing range queries on peer-to-peer data networks maintained over Distributed Hash Tables (DHTs). Placing consecutive data values in neighboring peers is frequently used in DHTs since it accelerates range query processing. However, such a placement is highly susceptible to load imbalances, which are preferably handled by replicating data (since replication also introduces fault tolerance benefits). In this paper, we present HotRoD, a DHT-based architecture that deals effectively with this combined problem through the use of a novel locality-preserving hash function, and a tunable data replication mechanism which allows trading off replication costs for fair load distribution. Our detailed experimentation study shows strong gains in both range query processing efficiency and data-access load balancing, with low replication overhead. To our knowledge, this is the first work that concurrently addresses the two conflicting problems using data replication.
[RACTI-RU1-2011-7] Becchetti, Luca, Chatzigiannakis, Ioannis and Giannakopoulos, Yiannis, Streaming techniques and data aggregation in networks of tiny artefacts, in: Computer Science Review, volume 5, number 1, pages 27-46, 2011. [DOI]
Abstract: In emerging pervasive scenarios, data is collected by sensing devices in streams that occur at several distributed points of observation. The size of the data typically far exceeds the storage and computational capabilities of the tiny devices that have to collect and process them. A general and challenging task is to allow (some of) the nodes of a pervasive network to collectively perform monitoring of a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all the data at a few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. Two main problems arise in this scenario: (i) the intrinsic complexity of maintaining statistics over a data stream whose size greatly exceeds the capabilities of the device that performs the computation; (ii) composing the partial outcomes computed at different points of observation into an accurate, global statistic over a neighbourhood of interest, which entails coping with several problems, last but not least the receipt of duplicate information along multiple paths of diffusion. Streaming techniques have emerged as powerful tools to achieve the general goals described above, in the first place because they assume a computational model in which computational and storage resources are assumed to be far exceeded by the amount of data on which computation occurs. In this contribution, we review the main streaming techniques and provide a classification of the computational problems and the applications they effectively address, with an emphasis on decentralized scenarios, which are of particular interest in pervasive networks