mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Provencher <>
Subject Re: Huge classification engine
Date Fri, 01 Apr 2011 15:57:40 GMT
Dan, I think what you propose make a lot of sense. I won't try to use Nutch
for now since we already have our crawler. But in the future, I think it
will be a great thing to add to our solution.

Here what I think, I'll try :
1. Create a job to parse the DBPedia dumps (the 3 on categories) and extract
all the valuable categories.
2. Use these categories to parse the Wikipedia dump to extract keywords for
those categories.
3. Train an algorithm (I don't know between Bayesian or SGD)
4. Test it with some text extracted from HTML pages to verify

After this try, I could try to add the external link of the Wikipedia dump
in step 2 to get more keywords per categories. What do you think of this

Dan, for the dmoz categories, I'm not sure how to plug it in since I can't
see an example of their dump (their link is down). I'll check that when I'll
download the full dump.

Thanks all for your useful answers



On Fri, Apr 1, 2011 at 11:19 AM, Ted Dunning <> wrote:

> Bixo is another option.
> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <> wrote:
>> On 1 April 2011 10:00, vineet yadav <> wrote:
>> > Hi,
>> > I suggest you to use Map-reduce with crawler architecture for crawling
>> > local file system. Since parsing HTML pages creates more overhead
>> > delays.
>> Apache Nutch being the obvious choice there -
>> I'd love to see some recipes documented that show Nutch and Mahout
>> combined. For example scenario, crawling some site(s), classifying and
>> having the results available in Lucene/Solr for search and other apps.
>> looks like a good
>> start for the Nutch side, but I'm unsure of the hooks / workflow for
>> Mahout integration.
>> Regarding training data for categorisation that targets Wikipedia
>> categories, you can always pull in the textual content of *external*
>> links referenced from Wikipedia. For this kind of app you can probably
>> use the extractions from the DBpedia project, see the various download
>> files at (you'll want at least the
>> 'external links' file, perhaps 'homepages' and others too). Also the
>> category information is extracted there, see: "article categories",
>> "category labels", and "categories (skos)" downloads. The latter gives
>> some hierarchy, which might be useful for filtering out noise like
>> admin categories or those that are absurdly detailed or general.
>> Another source of indicative text is to cross-reference these
>> categories to DMoz ( via common URLs. I started
>> an investigation of that using Pig, which I should either finish or
>> writeup. But Wikipedia's 'external links' plus using the category
>> hierarchy info should be a good place to start, I'd guess.
>> cheers,
>> Dan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message