nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Berlin Brown" <>
Subject Re: Possible public applications with nutch and hadoop
Date Sun, 14 Oct 2007 07:58:55 GMT
Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless.  I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.

I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

On 10/13/07, Pike <> wrote:
> Hi
> > My question; have you build a general site to crawl the internet and
> > how did you find links that people would be interested in as opposed
> > to capturing a lot of the junk out there.
> interesting question. are you planning to build a new google ?
> if you are planning to crawl without any limit on f.e. a few
> domains, your indexes will go wild very quickly :-)
> we are using nutch now with an extensive list of
> 'interesting domains' - this list is an editorial effort.
> search results are limited to those domains.
> another application would be to use nutch to crawl
> certain pages, like 'interesting' search results from
> other sites, with a limited depth. this would yield
> 'interesting' indexes.
> yet another application would be to crawl 'interesting'
> rss feeds with a depth of 1. I haven't got that working
> yet (see the parse-rss discussion these days).
> nevertheless, I am interested in the question:
> anyone else having examples of "possible public
> applications with nutch" ?
> $2c,
> *pike

Berlin Brown
newspirit technologies

View raw message