research unit 1

This site is powered by Aigaion - A PHP/Web based management system for shared and annotated bibliographies. For more information visit


Type of publication:Article
Entered by:chita
TitleAn Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining
Bibtex cite IDRACTI-RU1-2013-37
Journal International Journal of Software Engineering and Knowledge Engineering
Year published 2013
Month August
Volume 23
Number 6
Pages 869-886
Keywords web crawling,xHTML parser,document vector,cluster analysis,weighted Hamming dissimilarity,similarity measure.
This article presents a novel crawling and clustering method for extracting and pro- cessing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download we b documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each th ematic cultural area, filtering the docu- ments with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term o ccurrences. We calculate the dissimilarity between the cultural-related document vect ors and for each cultural theme, we use cluster analysis to partition the documents in to a number of clusters. Our approach is validated via a proof-of-concept applica tion which analyzes hundreds of web pages spanning different cultural thematic areas.
Tsekouras, G. E.
Gavalas, Damianos
IJSEKE_final.pdf (main file)
Publication ID1023