Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 52121 invoked from network); 14 Oct 2007 07:59:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Oct 2007 07:59:35 -0000 Received: (qmail 26013 invoked by uid 500); 14 Oct 2007 07:59:16 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 25984 invoked by uid 500); 14 Oct 2007 07:59:16 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 25972 invoked by uid 99); 14 Oct 2007 07:59:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Oct 2007 00:59:16 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of berlin.brown@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Oct 2007 07:59:16 +0000 Received: by wa-out-1112.google.com with SMTP id j40so1733260wah for ; Sun, 14 Oct 2007 00:58:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=03Jk6cEblpP/t3JlwomOZYHngTlLMDrQodABU0b1Ysg=; b=Yc3fjojy1E8wREgnhB+dZDtf2OuhY9dAZypYCkst5FA1kmcpVq4OnBreqKYkMIjd42kLqUlNT2StlPB/0rbJtSIDvj+Wpiqn/pPUj0FrrjhR7E8ejgGdJikyMhPyQVK7FD4IOwBx6Fg5vSsete20zjklQzx2d2Hh/46GcMAtgVU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=kkqWid5ytKZXteRDPvqR9PRP3bs9PtgAeWYTPoqdEzbxNMOaJeToX3TY1NkDOdmPjx6TUrd7VuPOnS/RA6SUYZu7vEycLHST1SOv61sVBsWnkUVSJN1Dld4VfT0OZ9UsD/x5W1lHISJcdI6BvHsFSZQehHpHrFLh1nQSBEx7d0g= Received: by 10.114.183.1 with SMTP id g1mr5558248waf.1192348735677; Sun, 14 Oct 2007 00:58:55 -0700 (PDT) Received: by 10.114.38.4 with HTTP; Sun, 14 Oct 2007 00:58:55 -0700 (PDT) Message-ID: Date: Sun, 14 Oct 2007 03:58:55 -0400 From: "Berlin Brown" To: nutch-user@lucene.apache.org Subject: Re: Possible public applications with nutch and hadoop In-Reply-To: <47117002.8000008@kw.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <47117002.8000008@kw.nl> X-Virus-Checked: Checked by ClamAV on apache.org Yea, you are right. You have to have a constrained set of domains to search and to be honest, that works pretty well. The only thing, I still get a lot of junk links. I would say that 30% are valid or interesting links while the other is kind of worthless. I guess it is a matter of studying spam filters and removing that but I have been kind of lazy in doing so. http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled I have already built a site that I am describing, based on a short list of popular domains using the very basic aspects of nutch. You can search above and see what you think. I had about 100k links with my last crawl. On 10/13/07, Pike wrote: > Hi > > > My question; have you build a general site to crawl the internet and > > how did you find links that people would be interested in as opposed > > to capturing a lot of the junk out there. > > interesting question. are you planning to build a new google ? > if you are planning to crawl without any limit on f.e. a few > domains, your indexes will go wild very quickly :-) > > we are using nutch now with an extensive list of > 'interesting domains' - this list is an editorial effort. > search results are limited to those domains. > http://www.labforculture.org/opensearch/custom > > another application would be to use nutch to crawl > certain pages, like 'interesting' search results from > other sites, with a limited depth. this would yield > 'interesting' indexes. > > yet another application would be to crawl 'interesting' > rss feeds with a depth of 1. I haven't got that working > yet (see the parse-rss discussion these days). > > nevertheless, I am interested in the question: > anyone else having examples of "possible public > applications with nutch" ? > > $2c, > *pike > > > > > > -- Berlin Brown http://botspiritcompany.com/botlist/spring/help/about.html newspirit technologies