CSU HAYWARD

DEPARTMENT OF MATHEMATICS AND

COMPUTER SCIENCE

THESIS PRESENTATION

Monday, March 1, 2004; 1-2pm Sc S105C

Speaker: Sushma Sampathkumar

Web Mining for Topics Defined by Complex and Precise Predicates

Several new techniques have been proposed in the recent years for this kind of topic specific web mining, and among them a key new technique is called focused crawling which is able to crawl topic-specific portions of the web without having to explore all pages. Most existing research on focused crawling considers a simple topic definition that typically consists of one or more keywords connected by an OR operator. In this project we implement an improved strategy for crawling topic specific portions of the web using complex and precise predicates. A complex predicate will allow the user to precisely specify a topic using Boolean operators such as "AND", "OR" and "NOT". Our work will concentrate on defining a framework for this kind of complex topic definitions and secondly on devising a crawl strategy to crawl the topic specific portions of the web efficiently and with minimal overhead. Our new crawl strategy will improve the performance of topic-specific web crawling by reducing the number of irrelevant pages crawled.

As part of the research we have built an improved focused crawling engine called "Eureka" that incorporates our new ideas. In this project report we describe the system design and architecture of Eureka and also present and compare the results obtained using this engine.