Aigaion: RACTI / RU1 Technical Report Series (Web Based)

[RACTI-RU1-2013-37] Tsekouras, G. E. and Gavalas, Damianos, An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining, in: International Journal of Software Engineering and Knowledge Engineering, volume 23, number 6, pages 869-886, 2013.
Abstract: This article presents a novel crawling and clustering method for extracting and pro- cessing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download we b documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each th ematic cultural area, filtering the docu- ments with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term o ccurrences. We calculate the dissimilarity between the cultural-related document vect ors and for each cultural theme, we use cluster analysis to partition the documents in to a number of clusters. Our approach is validated via a proof-of-concept applica tion which analyzes hundreds of web pages spanning different cultural thematic areas.
[RACTI-RU1-2005-14] Bender, Matthias, Michel, Sebastian, Triantafillou, Peter, Weikum, Gerhard and Zimmer, Christian, Improving Collection Selection with Overlap Awareness, in: the 28th International ACM SIGIR Conference, 2005.
Abstract: Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insufficient for several applications in which the collections typically overlap. This is the case, for example, for the collections built by autonomous peers crawling the web. We argue for the extension of existing quality measures using estimators of mutual overlap among collections and present experiments in which this combination outperforms CORI, a popular approach based on quality estimation. We outline our prototype implementation of a P2P web search engine, coined MINERVA1, that allows handling large amounts of data in a distributed and self-organizing manner. We conduct experiments which show that taking overlap into account during collection selection can drastically decrease the number of collections that have to be contacted in order to reach a satisfactory level of recall, which is a great step toward the feasibility of distributed web search.
[RACTI-RU1-2005-7] Michel, Sebastian, Bender, Matthias, Weikum, Gerhard, Zimmer, Christian and Triantafillou, Peter, P2P web search with MINERVA: How do you want to search tomorrow, in: ACM/IFIP/USENIX 6th International Middleware Conference, 2005.
Abstract: MINERVA1 is a novel approach towards P2P Web search that connects an a-priori unlimited number of peers, each of which maintains a personal local database and a local search facility. Each peer posts a small amount of metadata to a physically distributed directory layered on top of a DHT-based overlay network that is used to efficiently select promising peers from across the peer population that can best locally execute a query. This paper proposes a live demonstration of MINERVA, showcasing the full information lifecycle: crawling web pages, disseminating metadata to a distributed directory, and executing queries online. We additionally invite all visitors to instantly join the network by executing a small piece of software.