Aigaion: RACTI / RU1 Technical Report Series (Web Based)

[RACTI-RU1-2005-14] Bender, Matthias, Michel, Sebastian, Triantafillou, Peter, Weikum, Gerhard and Zimmer, Christian, Improving Collection Selection with Overlap Awareness, in: the 28th International ACM SIGIR Conference, 2005.
Abstract: Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insufficient for several applications in which the collections typically overlap. This is the case, for example, for the collections built by autonomous peers crawling the web. We argue for the extension of existing quality measures using estimators of mutual overlap among collections and present experiments in which this combination outperforms CORI, a popular approach based on quality estimation. We outline our prototype implementation of a P2P web search engine, coined MINERVA1, that allows handling large amounts of data in a distributed and self-organizing manner. We conduct experiments which show that taking overlap into account during collection selection can drastically decrease the number of collections that have to be contacted in order to reach a satisfactory level of recall, which is a great step toward the feasibility of distributed web search.
[RACTI-RU1-2001-1] Kaporis, Alexis, Kirousis, Lefteris, Kranakis, Evangelos, Krizanc, Danny, Stamatiou, Yannis and Stavropoulos, Elias, Locating information with uncertainty in fully interconnected networks with appli- cations to world wide web information retrieval, in: The Computer Journal, volume 44, number 4, pages 221-229, 2001.
Abstract: In this paper we examine the problem of searching for some information item in the nodes of a fully interconnected computer network, where each node contains information relevant to some topic as well as links to other network nodes that also contain information, not necessarily related to locally kept information. These links are used to facilitate the Internet users and mobile software agents that try to locate specific pieces of information. However, the links do not necessarily point to nodes containing information of interest to the user or relevant to the aims of the mobile agent. Thus an element of uncertainty is introduced. For example, when an Internet user or some search agent lands on a particular network node, they see a set of links that point to information that is, supposedly, relevant to the current search. Therefore, we can assume that a link points to relevant information with some unknown probability p that, in general, is related to the number of nodes in the network (intuitively, as the network grows, this probability tends to zero since adding more nodes to the network renders some extant links less accurate or obsolete). Consequently, since there is uncertainty as to whether the links contained in a node?s Web page are correct or not, a search algorithm cannot rely on following the links systematically since it may end up spending too much time visiting nodes that contain irrelevant information. In this work, we will describe and analyze a search algorithm that is only allowed to transfer a fixed amount of memory along communication links as it visits the network nodes. The algorithm is, however, allowed to use one bit of memory at each node as an ?already visited? flag. In this way the algorithm has its memory distributed to the network nodes, avoiding overloading the network links as it moves from node to node searching for the information. We work on fully interconnected networks for simplicity reasons and, moreover, because according to some recent experimental evidence, such networks can be considered to be a good approximation of the current structure of the World Wide Web.