CSU HAYWARD
DEPARTMENT OF MATHEMATICS AND
COMPUTER SCIENCE
THESIS PRESENTATION
Monday, March 1, 2004; 1-2pm Sc S105C
Speaker: Sushma Sampathkumar
Web Mining for Topics Defined by Complex and Precise Predicates
Several new techniques have been proposed in the recent years for this kind of topic specific web mining, and among them a key new technique is called focused crawling which is able to crawl topic-specific portions of the web without having to explore all pages. Most existing research on focused crawling considers a simple topic definition that typically consists of one or more keywords connected by an OR operator. In this project we implement an improved strategy for crawling topic specific portions of the web using complex and precise predicates. A complex predicate will allow the user to precisely specify a topic using Boolean operators such as "AND", "OR" and "NOT". Our work will concentrate on defining a framework for this kind of complex topic definitions and secondly on devising a crawl strategy to crawl the topic specific portions of the web efficiently and with minimal overhead. Our new crawl strategy will improve the performance of topic-specific web crawling by reducing the number of irrelevant pages crawled.
As part of the research we have built an improved focused crawling engine called "Eureka" that incorporates our new ideas. In this project report we describe the system design and architecture of Eureka and also present and compare the results obtained using this engine.