|
This site is powered by Aigaion - A PHP/Web based management system for shared and annotated bibliographies.
For more information visit Aigaion.nl. | |
Type of publication: | Article |
Entered by: | chita |
Title | An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining |
Bibtex cite ID | RACTI-RU1-2013-37 |
Journal | International Journal of Software Engineering and Knowledge Engineering |
Year published | 2013 |
Month | August |
Volume | 23 |
Number | 6 |
Pages | 869-886 |
Keywords | web crawling,xHTML parser,document vector,cluster analysis,weighted Hamming dissimilarity,similarity measure. |
Abstract | This article presents a novel crawling and
clustering method for extracting and pro-
cessing cultural data from the web in a fully
automated fashion. Our architecture relies
upon a focused web crawler to download we
b documents relevant to culture. The
focused crawler is a web crawler that
searches and processes only those web pages
that are relevant to a particular topic. After downloading the pages, we extract from
each document a number of words for each th
ematic cultural area, filtering the docu-
ments with non-cultural content; we then create multidimensional document vectors
comprising the most frequent cultural term o
ccurrences. We calculate the dissimilarity
between the cultural-related document vect
ors and for each cultural theme, we use
cluster analysis to partition the documents in
to a number of clusters. Our approach is
validated via a proof-of-concept applica
tion which analyzes hundreds of web pages
spanning different cultural thematic areas. |
Authors | |
Topics
| |
BibTeX | BibTeX |
RIS | RIS |
Attachments |
IJSEKE_final.pdf (main file) |
|
Publication ID | 1023 |
|
|