mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hofer <m...@marc-hofer.de>
Subject Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing
Date Mon, 30 Nov 2009 13:55:28 GMT

> Hi guys,
>   
Hi Julien,
> Why not using Behemoth to deploy your UIMA application on Hadoop? (
> http://code.google.com/p/behemoth-pebble/)
>   
Behemoth uses for input & output the HDFS. We integrated so far Heritrix 
in combination with the HBase writer ( 
http://code.google.com/p/hbase-writer/ ) and focus on our whole 
architecture on HBase. It will be nice, if Behemoth supports HBase in 
the future.

> Behemoth is meant to do exactly what you described and has already an
> adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a
> Hadoop cluster and extract some of the UIMA-generated annotations + store
> them at a neutral format which could then be used to generate vectors for
> Mahout. The purpose of Behemoth is to facilitate the deployment of NLP
> components for large scale processing and act as a bridge between common
> inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc...
>   
In the course of facilitating the deployment of NLP components, you are 
perfectly right.
> If we had a mechanism for generating Mahout vectors from Behemoth
> annotations we would be able to use other NLP frameworks such as GATE as
> well. Doing something like this is on the roadmap for Behemoth anyway but it
> sounds like what you are planning to do would be a perfect match.
>
> Any thoughts on this?
>
> Julien
>
>   

Marc

Mime
View raw message