mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <>
Subject Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing
Date Mon, 30 Nov 2009 11:23:28 GMT
Hi guys,

Why not using Behemoth to deploy your UIMA application on Hadoop? (

Behemoth is meant to do exactly what you described and has already an
adapter for Nutch & WARC archives. It can take a UIMA pear deploy it on a
Hadoop cluster and extract some of the UIMA-generated annotations + store
them at a neutral format which could then be used to generate vectors for
Mahout. The purpose of Behemoth is to facilitate the deployment of NLP
components for large scale processing and act as a bridge between common
inputs (e.g. Nutch, WARC) and other projects (Mahout, Tika) etc...

If we had a mechanism for generating Mahout vectors from Behemoth
annotations we would be able to use other NLP frameworks such as GATE as
well. Doing something like this is on the roadmap for Behemoth anyway but it
sounds like what you are planning to do would be a perfect match.

Any thoughts on this?


DigitalPebble Ltd

2009/11/28 Marc Hofer <>

> Hello everybody,
> having already presented the draft of our architecture, I would like now to
> discuss the second layer more in detail. As mentioned before we have chosen
> UIMA for this layer. The main aggregate currently consists of the Whitespace
> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
> StopwordFilter. Before processing this aggregate in a map-only job in
> Hadoop, we want to filter all HTML tags and forward only this preprocessed
> data to the aggregate. The reason for this is that it is difficult to change
> the document during processing in UIMA and it is impractical to work all the
> time on documents containing HTML tags.
> Furthermore we are planning to add the Tagger Annotator, which implements a
> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
> corresponding part of speech tags to delete or not and so using them for the
> feature extraction. One purpose could be to use at the very beginning only
> substantives and verbs.
> We are very interested in your comments and remarks and it would be nice to
> hear from you.
> Cheers,
> Marc

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message