Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 91266 invoked from network); 1 Apr 2011 15:58:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2011 15:58:09 -0000 Received: (qmail 81057 invoked by uid 500); 1 Apr 2011 15:58:08 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 81028 invoked by uid 500); 1 Apr 2011 15:58:08 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 81020 invoked by uid 99); 1 Apr 2011 15:58:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 15:58:08 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mprovencher86@gmail.com designates 209.85.212.171 as permitted sender) Received: from [209.85.212.171] (HELO mail-px0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 15:58:02 +0000 Received: by pxi7 with SMTP id 7so1115266pxi.30 for ; Fri, 01 Apr 2011 08:57:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=GiNq7gyosvGKYucvLAsqyspXug4B3l9Im4h1+TPABUQ=; b=ChmxJbrZvHvXZtT86CxzxFUYSfrH9Lyei817wUM6W7Q62odkt8ygtmx+BxusJJe4CG AvejsLv8VvjcLCmFDx+RltyLJvcMIzzaWMGiJOBQpAR24338WRcVLupRImh2c1+HpVmG HZKwNV1KCYYF7s/LcYbfdEgoMAWnyL89M/dxw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=fXDNDlF6TOtmvuWLU6gruLddLy9fTwwUCJYRYfhClaRY5FCwsbWTnlgM9C3L4MAKcq D5N3uYpumz8nMUqU0wBba/Z9AqWLLZx767dQQ2e9IPKEilQ4a7cYHUhUYal5I4oYLA5k GwSG/l3O1yBMifPXWnOhhjbvMvAGuaXdiv5BA= MIME-Version: 1.0 Received: by 10.142.136.18 with SMTP id j18mr3109625wfd.284.1301673460759; Fri, 01 Apr 2011 08:57:40 -0700 (PDT) Received: by 10.142.239.13 with HTTP; Fri, 1 Apr 2011 08:57:40 -0700 (PDT) In-Reply-To: References: Date: Fri, 1 Apr 2011 11:57:40 -0400 Message-ID: Subject: Re: Huge classification engine From: Martin Provencher To: Ted Dunning Cc: user@mahout.apache.org, Dan Brickley , vineet yadav Content-Type: multipart/alternative; boundary=000e0cd1482c3dfe6c049fdd787b X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd1482c3dfe6c049fdd787b Content-Type: text/plain; charset=ISO-8859-1 Dan, I think what you propose make a lot of sense. I won't try to use Nutch for now since we already have our crawler. But in the future, I think it will be a great thing to add to our solution. Here what I think, I'll try : 1. Create a job to parse the DBPedia dumps (the 3 on categories) and extract all the valuable categories. 2. Use these categories to parse the Wikipedia dump to extract keywords for those categories. 3. Train an algorithm (I don't know between Bayesian or SGD) 4. Test it with some text extracted from HTML pages to verify After this try, I could try to add the external link of the Wikipedia dump in step 2 to get more keywords per categories. What do you think of this plan? Dan, for the dmoz categories, I'm not sure how to plug it in since I can't see an example of their dump (their link is down). I'll check that when I'll download the full dump. Thanks all for your useful answers Regards, Martin On Fri, Apr 1, 2011 at 11:19 AM, Ted Dunning wrote: > Bixo is another option. http://bixolabs.com/ > > > On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley wrote: > >> On 1 April 2011 10:00, vineet yadav wrote: >> > Hi, >> > I suggest you to use Map-reduce with crawler architecture for crawling >> > local file system. Since parsing HTML pages creates more overhead >> > delays. >> >> Apache Nutch being the obvious choice there - http://nutch.apache.org/ >> >> I'd love to see some recipes documented that show Nutch and Mahout >> combined. For example scenario, crawling some site(s), classifying and >> having the results available in Lucene/Solr for search and other apps. >> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good >> start for the Nutch side, but I'm unsure of the hooks / workflow for >> Mahout integration. >> >> Regarding training data for categorisation that targets Wikipedia >> categories, you can always pull in the textual content of *external* >> links referenced from Wikipedia. For this kind of app you can probably >> use the extractions from the DBpedia project, see the various download >> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the >> 'external links' file, perhaps 'homepages' and others too). Also the >> category information is extracted there, see: "article categories", >> "category labels", and "categories (skos)" downloads. The latter gives >> some hierarchy, which might be useful for filtering out noise like >> admin categories or those that are absurdly detailed or general. >> >> Another source of indicative text is to cross-reference these >> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started >> an investigation of that using Pig, which I should either finish or >> writeup. But Wikipedia's 'external links' plus using the category >> hierarchy info should be a good place to start, I'd guess. >> >> cheers, >> >> Dan >> > > --000e0cd1482c3dfe6c049fdd787b--