Abstract: This article presents a novel crawling and
clustering method for extracting and pro-
cessing cultural data from the web in a fully
automated fashion. Our architecture relies
upon a focused web crawler to download we
b documents relevant to culture. The
focused crawler is a web crawler that
searches and processes only those web pages
that are relevant to a particular topic. After downloading the pages, we extract from
each document a number of words for each th
ematic cultural area, filtering the docu-
ments with non-cultural content; we then create multidimensional document vectors
comprising the most frequent cultural term o
ccurrences. We calculate the dissimilarity
between the cultural-related document vect
ors and for each cultural theme, we use
cluster analysis to partition the documents in
to a number of clusters. Our approach is
validated via a proof-of-concept applica
tion which analyzes hundreds of web pages
spanning different cultural thematic areas.