Abstract | Collection selection has been a research issue for years. Typically,
in related work, precomputed statistics are employed
in order to estimate the expected result quality of each collection,
and subsequently the collections are ranked accordingly.
Our thesis is that this simple approach is insufficient
for several applications in which the collections typically
overlap. This is the case, for example, for the collections
built by autonomous peers crawling the web. We
argue for the extension of existing quality measures using
estimators of mutual overlap among collections and present
experiments in which this combination outperforms CORI,
a popular approach based on quality estimation. We outline
our prototype implementation of a P2P web search engine,
coined MINERVA1, that allows handling large amounts of
data in a distributed and self-organizing manner. We conduct
experiments which show that taking overlap into account
during collection selection can drastically decrease the
number of collections that have to be contacted in order to
reach a satisfactory level of recall, which is a great step toward
the feasibility of distributed web search. |