Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: nutch-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of berlin.brown@gmail.com
 designates 209.85.146.178 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=kkqWid5ytKZXteRDPvqR9PRP3bs9PtgAeWYTPoqdEzbxNMOaJeToX3TY1NkDOdmPjx6TUrd7VuPOnS/RA6SUYZu7vEycLHST1SOv61sVBsWnkUVSJN1Dld4VfT0OZ9UsD/x5W1lHISJcdI6BvHsFSZQehHpHrFLh1nQSBEx7d0g=
Message-ID: <df27a55b0710140058n85c14e2y1e7ba0c999345a5c@mail.gmail.com>
Date: Sun, 14 Oct 2007 03:58:55 -0400
From: "Berlin Brown" <berlin.brown@gmail.com>
To: nutch-user@lucene.apache.org
Subject: Re: Possible public applications with nutch and hadoop
In-Reply-To: <47117002.8000008@kw.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <df27a55b0710131725g425fc4cbraadb057c9f4f9dd3@mail.gmail.com>
	 <47117002.8000008@kw.nl>

Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless.  I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.

http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled

I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

On 10/13/07, Pike <pike@kw.nl> wrote:
> Hi
>
> > My question; have you build a general site to crawl the internet and
> > how did you find links that people would be interested in as opposed
> > to capturing a lot of the junk out there.
>
> interesting question. are you planning to build a new google ?
> if you are planning to crawl without any limit on f.e. a few
> domains, your indexes will go wild very quickly :-)
>
> we are using nutch now with an extensive list of
> 'interesting domains' - this list is an editorial effort.
> search results are limited to those domains.
> http://www.labforculture.org/opensearch/custom
>
> another application would be to use nutch to crawl
> certain pages, like 'interesting' search results from
> other sites, with a limited depth. this would yield
> 'interesting' indexes.
>
> yet another application would be to crawl 'interesting'
> rss feeds with a depth of 1. I haven't got that working
> yet (see the parse-rss discussion these days).
>
> nevertheless, I am interested in the question:
> anyone else having examples of "possible public
> applications with nutch" ?
>
> $2c,
> *pike
>
>
>
>
>
>


-- 
Berlin Brown
http://botspiritcompany.com/botlist/spring/help/about.html
newspirit technologies