Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of mprovencher86@gmail.com
 designates 209.85.212.171 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=fXDNDlF6TOtmvuWLU6gruLddLy9fTwwUCJYRYfhClaRY5FCwsbWTnlgM9C3L4MAKcq
         D5N3uYpumz8nMUqU0wBba/Z9AqWLLZx767dQQ2e9IPKEilQ4a7cYHUhUYal5I4oYLA5k
         GwSG/l3O1yBMifPXWnOhhjbvMvAGuaXdiv5BA=
MIME-Version: 1.0
In-Reply-To: <AANLkTin1Nf5nSMH0NnPkue1MLKYWcWrynkZoiHqpF3Et@mail.gmail.com>
References: <AANLkTi=P-VxgSPsAHy-aK8vuE3Nsx+zF9OiV_ireA1-8@mail.gmail.com>
	<AANLkTikVct_2Qks_dCP7fcLKLJv78jDRsW0-L4JaM+Y1@mail.gmail.com>
	<AANLkTi=TbkvwtD8-Jc9v4hXFXtR1CrYzbfvQNrL6XtGn@mail.gmail.com>
	<AANLkTi=6FKXtFQxgV0qy6HNUixs5JLUdupF0r=FMq62j@mail.gmail.com>
	<AANLkTi=kVGT+uMWC1CAU=NEUHFYx-w9H-NKi_-MtfF7c@mail.gmail.com>
	<AANLkTim6Zpfrvm6uK5QF3QzRBrxwNo=y_NvkRFY3=ZJd@mail.gmail.com>
	<AANLkTikpvaz4C07ztrXyfNAWJuwuC+dZ_4EJa3yZOQLj@mail.gmail.com>
	<AANLkTin1Nf5nSMH0NnPkue1MLKYWcWrynkZoiHqpF3Et@mail.gmail.com>
Date: Fri, 1 Apr 2011 11:57:40 -0400
Message-ID: <BANLkTinAcm8JNOvZAET+DRNhca8UqDdj1A@mail.gmail.com>
Subject: Re: Huge classification engine
From: Martin Provencher <mprovencher86@gmail.com>
To: Ted Dunning <ted.dunning@gmail.com>
Cc: user@mahout.apache.org, Dan Brickley <danbri@danbri.org>,
	vineet yadav <vineet.yadav.iiit@gmail.com>
Content-Type: multipart/alternative; boundary=000e0cd1482c3dfe6c049fdd787b

--000e0cd1482c3dfe6c049fdd787b
Content-Type: text/plain; charset=ISO-8859-1

Dan, I think what you propose make a lot of sense. I won't try to use Nutch
for now since we already have our crawler. But in the future, I think it
will be a great thing to add to our solution.

Here what I think, I'll try :
1. Create a job to parse the DBPedia dumps (the 3 on categories) and extract
all the valuable categories.
2. Use these categories to parse the Wikipedia dump to extract keywords for
those categories.
3. Train an algorithm (I don't know between Bayesian or SGD)
4. Test it with some text extracted from HTML pages to verify

After this try, I could try to add the external link of the Wikipedia dump
in step 2 to get more keywords per categories. What do you think of this
plan?

Dan, for the dmoz categories, I'm not sure how to plug it in since I can't
see an example of their dump (their link is down). I'll check that when I'll
download the full dump.

Thanks all for your useful answers

Regards,

Martin

On Fri, Apr 1, 2011 at 11:19 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Bixo is another option.  http://bixolabs.com/
>
>
> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <danbri@danbri.org> wrote:
>
>> On 1 April 2011 10:00, vineet yadav <vineet.yadav.iiit@gmail.com> wrote:
>> > Hi,
>> > I suggest you to use Map-reduce with crawler architecture for crawling
>> > local file system. Since parsing HTML pages creates more overhead
>> > delays.
>>
>> Apache Nutch being the obvious choice there - http://nutch.apache.org/
>>
>> I'd love to see some recipes documented that show Nutch and Mahout
>> combined. For example scenario, crawling some site(s), classifying and
>> having the results available in Lucene/Solr for search and other apps.
>> http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
>> start for the Nutch side, but I'm unsure of the hooks / workflow for
>> Mahout integration.
>>
>> Regarding training data for categorisation that targets Wikipedia
>> categories, you can always pull in the textual content of *external*
>> links referenced from Wikipedia. For this kind of app you can probably
>> use the extractions from the DBpedia project, see the various download
>> files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
>> 'external links' file, perhaps 'homepages' and others too). Also the
>> category information is extracted there, see: "article categories",
>> "category labels", and "categories (skos)" downloads. The latter gives
>> some hierarchy, which might be useful for filtering out noise like
>> admin categories or those that are absurdly detailed or general.
>>
>> Another source of indicative text is to cross-reference these
>> categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
>> an investigation of that using Pig, which I should either finish or
>> writeup. But Wikipedia's 'external links' plus using the category
>> hierarchy info should be a good place to start, I'd guess.
>>
>> cheers,
>>
>> Dan
>>
>
>

--000e0cd1482c3dfe6c049fdd787b--