Abstract: This article presents a novel crawling and
clustering method for extracting and pro-
cessing cultural data from the web in a fully
automated fashion. Our architecture relies
upon a focused web crawler to download we
b documents relevant to culture. The
focused crawler is a web crawler that
searches and processes only those web pages
that are relevant to a particular topic. After downloading the pages, we extract from
each document a number of words for each th
ematic cultural area, filtering the docu-
ments with non-cultural content; we then create multidimensional document vectors
comprising the most frequent cultural term o
ccurrences. We calculate the dissimilarity
between the cultural-related document vect
ors and for each cultural theme, we use
cluster analysis to partition the documents in
to a number of clusters. Our approach is
validated via a proof-of-concept applica
tion which analyzes hundreds of web pages
spanning different cultural thematic areas.
Abstract: Collection selection has been a research issue for years. Typically,
in related work, precomputed statistics are employed
in order to estimate the expected result quality of each collection,
and subsequently the collections are ranked accordingly.
Our thesis is that this simple approach is insufficient
for several applications in which the collections typically
overlap. This is the case, for example, for the collections
built by autonomous peers crawling the web. We
argue for the extension of existing quality measures using
estimators of mutual overlap among collections and present
experiments in which this combination outperforms CORI,
a popular approach based on quality estimation. We outline
our prototype implementation of a P2P web search engine,
coined MINERVA1, that allows handling large amounts of
data in a distributed and self-organizing manner. We conduct
experiments which show that taking overlap into account
during collection selection can drastically decrease the
number of collections that have to be contacted in order to
reach a satisfactory level of recall, which is a great step toward
the feasibility of distributed web search.
Abstract: MINERVA1 is a novel approach towards P2P Web search
that connects an a-priori unlimited number of peers, each of which maintains
a personal local database and a local search facility. Each peer posts
a small amount of metadata to a physically distributed directory layered
on top of a DHT-based overlay network that is used to efficiently select
promising peers from across the peer population that can best locally execute
a query. This paper proposes a live demonstration of MINERVA,
showcasing the full information lifecycle: crawlingweb pages, disseminating
metadata to a distributed directory, and executing queries online. We
additionally invite all visitors to instantly join the network by executing
a small piece of software.