nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suhail Ahmed <>
Subject Re: topic based crawling
Date Mon, 16 May 2005 19:47:41 GMT
Hello Roger,

I have an initial prototype up and running with Nutch. I haven't  
gotten to a point where I have sufficient data to start running some  
statistical analysis on my little "rogue watch" corpus. My first  
objective is to aggregate as much news as possible about some six or  
seven countries. The first bit I have noticed is that a lot of news  
sites are essentially reporting either verbatim or rehash of what are  
just a few primary sources. I haven't quite figured out how to  
perform similarity analysis on the articles. However going back to  
your own query about hard coding, What I plan to do is increase my  
systems vocabulary so that I can widen the outlinks. Not before I  
figure out how nutch works in detail.


On May 16, 2005, at 3:59 PM, Joe Reger, Jr. wrote:

> Hi.  I'm interested in these questions as well.  For me, the main  
> question
> is how to focus crawling on a given topic while using the entire  
> web.  If
> it's hard-focused to a set of sites I'll miss too much new  
> content.  If it's
> open to the web it can be hard to maintain scope.  Thanks, Joe
> -----Original Message-----
> From: Suhail Ahmed []
> Sent: Saturday, May 14, 2005 5:39 AM
> To:
> Subject: topic based crawling
> Hi,
> I am trying to figure out how I can user Nutch to build something like
> but only to monitor news about countries on the State
> Departments "state sponsored terrorism" list (http://
>". I have a URL file  
> with some
> 2000 online newspapers. I would like to confirm the veracity of my  
> approach
> which I feel is probably wrong. My first fetch gets the home pages  
> of the
> newspapers. I have a modified  
> org.apache.nutch.parse.html.HtmlParser to
> store only those outlinks which contains a simple list of nouns  
> related to
> the topic I am interested in. Is this right? I am assuming here  
> that by
> doing so, the second fetch I perform ends up fetching the actual  
> stories
> related to the links from the home page. It "sort of" works. I say  
> sort of
> because unlike, performing a search on say "North  
> Korea" returns
> both home pages and sometimes, the article page itself where  
> just displays a hyperlink list to the actual news article. How  
> would I get
> nutch search to return the results of only the second crawl and not  
> the
> first one? Naturally the second problem is one of categorizing the  
> actual
> content. Which parts of Nutch or Lucene do I have to work with to  
> categorize
> (analyze?) the results of the second fetch? The third bit to how do I
> determine the timestamp on the content fetched so I can display the  
> time of
> publication as does. I promise to write up any help  
> provided on
> the Nutch Wiki so others will know how as well. Thanks a lot.
> Suhail

View raw message