mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Huge classification engine
Date Fri, 01 Apr 2011 19:52:24 GMT
Thanks a bunch, Julien!

On Fri, Apr 1, 2011 at 12:49 PM, Julien Nioche
<lists.digitalpebble@gmail.com> wrote:
> Dmitriy,
>
> Have a look at Behemoth (https://github.com/jnioche/behemoth). It can be
> used as a bridge between Nutch and Mahout. I've written a module for Mahout
> which generates Mahout vectors from a Behemoth sequence file. There is also
> a IO module which can convert Nutch segments into Behemoth sequence files.
> The combination of both should do the trick + you can use text analysis
> components such as UIMA or GATE as well to generate additional attributes
> beyond simple tokens.
>
> HTH
>
> Julien
>
> On 1 April 2011 19:52, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
>> Yes. That's the problem. How to pull mahout vectorizers on the Nutch
>> data. If anybody knows of step-by-step howto, please do suggest. I
>> looked into it briefly and did not find a good solution immediately,
>> hence we are using a crawler other than nutch with a little clearer
>> documented api for document archivng than it immediately available in
>> Nutch documentation.
>>
>>
>> On Fri, Apr 1, 2011 at 1:24 AM, Dan Brickley <danbri@danbri.org> wrote:
>> > On 1 April 2011 10:00, vineet yadav <vineet.yadav.iiit@gmail.com> wrote:
>> >> Hi,
>> >> I suggest you to use Map-reduce with crawler architecture for crawling
>> >> local file system. Since parsing HTML pages creates more overhead
>> >> delays.
>> >
>> > Apache Nutch being the obvious choice there - http://nutch.apache.org/
>> >
>> > I'd love to see some recipes documented that show Nutch and Mahout
>> > combined. For example scenario, crawling some site(s), classifying and
>> > having the results available in Lucene/Solr for search and other apps.
>> > http://wiki.apache.org/nutch/NutchHadoopTutorial looks like a good
>> > start for the Nutch side, but I'm unsure of the hooks / workflow for
>> > Mahout integration.
>> >
>> > Regarding training data for categorisation that targets Wikipedia
>> > categories, you can always pull in the textual content of *external*
>> > links referenced from Wikipedia. For this kind of app you can probably
>> > use the extractions from the DBpedia project, see the various download
>> > files at http://wiki.dbpedia.org/Downloads36 (you'll want at least the
>> > 'external links' file, perhaps 'homepages' and others too). Also the
>> > category information is extracted there, see: "article categories",
>> > "category labels", and "categories (skos)" downloads. The latter gives
>> > some hierarchy, which might be useful for filtering out noise like
>> > admin categories or those that are absurdly detailed or general.
>> >
>> > Another source of indicative text is to cross-reference these
>> > categories to DMoz (http://rdf.dmoz.org/) via common URLs. I started
>> > an investigation of that using Pig, which I should either finish or
>> > writeup. But Wikipedia's 'external links' plus using the category
>> > hierarchy info should be a good place to start, I'd guess.
>> >
>> > cheers,
>> >
>> > Dan
>> >
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Mime
View raw message