nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe Reger, Jr." <...@joereger.com>
Subject RE: topic based crawling
Date Mon, 16 May 2005 13:59:29 GMT

Hi.  I'm interested in these questions as well.  For me, the main question
is how to focus crawling on a given topic while using the entire web.  If
it's hard-focused to a set of sites I'll miss too much new content.  If it's
open to the web it can be hard to maintain scope.  Thanks, Joe 

-----Original Message-----
From: Suhail Ahmed [mailto:ilyanov@mac.com] 
Sent: Saturday, May 14, 2005 5:39 AM
To: nutch-user@incubator.apache.org
Subject: topic based crawling

Hi,

I am trying to figure out how I can user Nutch to build something like
news.google.com but only to monitor news about countries on the State
Departments "state sponsored terrorism" list (http://
www.state.gov/s/ct/rls/pgtrpt/2003/31644.htm)". I have a URL file with some
2000 online newspapers. I would like to confirm the veracity of my approach
which I feel is probably wrong. My first fetch gets the home pages of the
newspapers. I have a modified org.apache.nutch.parse.html.HtmlParser to
store only those outlinks which contains a simple list of nouns related to
the topic I am interested in. Is this right? I am assuming here that by
doing so, the second fetch I perform ends up fetching the actual stories
related to the links from the home page. It "sort of" works. I say sort of
because unlike news.google, performing a search on say "North Korea" returns
both home pages and sometimes, the article page itself where news.google
just displays a hyperlink list to the actual news article. How would I get
nutch search to return the results of only the second crawl and not the
first one? Naturally the second problem is one of categorizing the actual
content. Which parts of Nutch or Lucene do I have to work with to categorize
(analyze?) the results of the second fetch? The third bit to how do I
determine the timestamp on the content fetched so I can display the time of
publication as news.google does. I promise to write up any help provided on
the Nutch Wiki so others will know how as well. Thanks a lot.

Suhail


Mime
View raw message