Abstract: Collection selection has been a research issue for years. Typically,
in related work, precomputed statistics are employed
in order to estimate the expected result quality of each collection,
and subsequently the collections are ranked accordingly.
Our thesis is that this simple approach is insufficient
for several applications in which the collections typically
overlap. This is the case, for example, for the collections
built by autonomous peers crawling the web. We
argue for the extension of existing quality measures using
estimators of mutual overlap among collections and present
experiments in which this combination outperforms CORI,
a popular approach based on quality estimation. We outline
our prototype implementation of a P2P web search engine,
coined MINERVA1, that allows handling large amounts of
data in a distributed and self-organizing manner. We conduct
experiments which show that taking overlap into account
during collection selection can drastically decrease the
number of collections that have to be contacted in order to
reach a satisfactory level of recall, which is a great step toward
the feasibility of distributed websearch.
Abstract: In this paper we examine the problem of searching for some information item in the nodes of a fully
interconnected computer network, where each node contains information relevant to some topic
as well as links to other network nodes that also contain information, not necessarily related to
locally kept information. These links are used to facilitate the Internet users and mobile software
agents that try to locate specific pieces of information. However, the links do not necessarily point
to nodes containing information of interest to the user or relevant to the aims of the mobile agent.
Thus an element of uncertainty is introduced. For example, when an Internet user or some search
agent lands on a particular network node, they see a set of links that point to information that is,
supposedly, relevant to the current search. Therefore, we can assume that a link points to relevant
information with some unknown probability p that, in general, is related to the number of nodes
in the network (intuitively, as the network grows, this probability tends to zero since adding more
nodes to the network renders some extant links less accurate or obsolete). Consequently, since there
is uncertainty as to whether the links contained in a node?s Web page are correct or not, a search
algorithm cannot rely on following the links systematically since it may end up spending too much
time visiting nodes that contain irrelevant information. In this work, we will describe and analyze
a search algorithm that is only allowed to transfer a fixed amount of memory along communication
links as it visits the network nodes. The algorithm is, however, allowed to use one bit of memory at
each node as an ?already visited? flag. In this way the algorithm has its memory distributed to the
network nodes, avoiding overloading the network links as it moves from node to node searching for
the information. We work on fully interconnected networks for simplicity reasons and, moreover,
because according to some recent experimental evidence, such networks can be considered to be a
good approximation of the current structure of the World Wide Web.