nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suhail Ahmed <ilya...@mac.com>
Subject topic based crawling
Date Sat, 14 May 2005 09:39:06 GMT
Hi,

I am trying to figure out how I can user Nutch to build something  
like news.google.com but only to monitor news about countries on the  
State Departments "state sponsored terrorism" list (http:// 
www.state.gov/s/ct/rls/pgtrpt/2003/31644.htm)". I have a URL file  
with some 2000 online newspapers. I would like to confirm the  
veracity of my approach which I feel is probably wrong. My first  
fetch gets the home pages of the newspapers. I have a modified  
org.apache.nutch.parse.html.HtmlParser to store only those outlinks  
which contains a simple list of nouns related to the topic I am  
interested in. Is this right? I am assuming here that by doing so,  
the second fetch I perform ends up fetching the actual stories  
related to the links from the home page. It "sort of" works. I say  
sort of because unlike news.google, performing a search on say "North  
Korea" returns both home pages and sometimes, the article page itself  
where news.google just displays a hyperlink list to the actual news  
article. How would I get nutch search to return the results of only  
the second crawl and not the first one? Naturally the second problem  
is one of categorizing the actual content. Which parts of Nutch or  
Lucene do I have to work with to categorize (analyze?) the results of  
the second fetch? The third bit to how do I determine the timestamp  
on the content fetched so I can display the time of publication as  
news.google does. I promise to write up any help provided on the  
Nutch Wiki so others will know how as well. Thanks a lot.

Suhail

Mime
View raw message