Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 91705 invoked from network); 13 Nov 2009 19:49:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Nov 2009 19:49:05 -0000 Received: (qmail 72465 invoked by uid 500); 13 Nov 2009 19:49:05 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 72391 invoked by uid 500); 13 Nov 2009 19:49:05 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 72380 invoked by uid 99); 13 Nov 2009 19:49:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2009 19:49:04 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of milesosb@gmail.com designates 209.85.210.192 as permitted sender) Received: from [209.85.210.192] (HELO mail-yx0-f192.google.com) (209.85.210.192) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2009 19:49:02 +0000 Received: by yxe30 with SMTP id 30so3873792yxe.29 for ; Fri, 13 Nov 2009 11:48:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:references:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:x-mailer :mime-version:subject:date:cc; bh=06UqCkJzk9yqxhX9fVpA9+XCRMAVwtS8aA/3MEHBCEc=; b=jbyfs+dNSr94c9n5rJOwntV4div7i59p44CnOvTEmCqZNSstTy+oKMaXoPskondn3A 1rdWKHBVsMqZew+cbPlj1kwSyJE0V0vTKbuizlt8JXS01QqK68OezFuJ7Kub8SjFCFez dyWsNITMeeOAzr63ky2klcpjAZUm4mzsf51io= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:message-id:from:to:in-reply-to:content-type :content-transfer-encoding:x-mailer:mime-version:subject:date:cc; b=SrkpMhtWmmqt8/0d5Mo/D4lTiU0I8VT94D30+0AQ3fsbpt1wt3PV0v0eHq7FQQ3eUC GdzjG6Xen0SxtGpyXji0woAgwVdFei9VvuDcUsrxKlfkifU3VHXj8pNUbvo+TBicngL3 /vDV8FEE/BstACY3lnhyfo2z12igaB2iy1DWw= Received: by 10.213.103.210 with SMTP id l18mr3198622ebo.71.1258141720804; Fri, 13 Nov 2009 11:48:40 -0800 (PST) Received: from ?192.168.1.101? (88-110-162-140.dynamic.dsl.as9105.com [88.110.162.140]) by mx.google.com with ESMTPS id 7sm1913237eyg.9.2009.11.13.11.48.39 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 13 Nov 2009 11:48:39 -0800 (PST) References: <5E00122C-1C77-4623-922E-867502CD09AA@apache.org> <89776790-590D-4080-A76B-3482D57556AE@transpac.com> Message-Id: <9797DDF9-805E-4EA9-BE40-89F638F4D3CF@gmail.com> From: Miles To: "mahout-user@lucene.apache.org" In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Mailer: iPod Mail (7C145) Mime-Version: 1.0 (iPod Mail 7C145) Subject: Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/ Date: Fri, 13 Nov 2009 19:48:32 +0000 Cc: "mahout-user@lucene.apache.org" A very simple way to remove boilerplate (and this is trivial using Map Reduce) is to just remove all duplicate sentences. This does assume you can extract sentences, do sentence boundary detection etc. Miled Sent from your Ipod On 13 Nov 2009, at 19:06, Ted Dunning wrote: > This looks like a very nice approach for getting rid of the goo. I > often > advocate using words/phrases/ngrams that are highly predicted by the > domain > name as an alternative for removing boilerplate. That has the > advantage > that it doesn't require training text. In the case of wiki-pedia, > this is > not so useful because everything is in the same domain. The domain > predictor trick will only work if the feature you are using for the > input is > not very content based. Thus, this can fail for small domain- > focused sites > or if you use a content laden URL for the task. > > > > On Fri, Nov 13, 2009 at 10:36 AM, Ken Krugler > wrote: > >> Hi all, >> >> Another issue came up, about cleaning the text. >> >> One interested user suggested using nCleaner (see >> http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf) as >> a way >> of tossing boilerplate text that skews text frequency data. >> >> Any thoughts on this? >> >> Thanks, >> >> -- Ken >> >> >> On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: >> >> Might be of interest to all you Mahouts out there... >>> http://bixolabs.com/datasets/public-terabyte-dataset-project/ >>> >>> Would be cool to get this converted over to our vector format so >>> that we >>> can cluster, etc. >>> >> >> -------------------------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> e l a s t i c w e b m i n i n g >> >> >> >> >> > > > -- > Ted Dunning, CTO > DeepDyve