uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Scaling using Hadoop
Date Wed, 05 Oct 2011 20:43:14 GMT
We use hadoop with UIMA.  Here's the "fit", in one case:

1) UIMA runs as the map step; we put the uima pipeline into the mapper.  Hadoop
has a configure (?) method where you can stick the creation and set up of the
uima pipeline, similar to UIMA's initialize.

2) Write a hadoop record reader that reads input from hadoop's "splits", and
creates things that would go into individual CASes.  These are the input to the
Map step.

3) The map takes the input (a string, say), and puts it into a CAS, and then
calls the process() method on the engine it set up and initialized in step 1.

4) When the process method returns, the CAS has all the results - iterate thru
it and extract whatever you want, and stick those values into your hadoop output
object, and output it.

5) The reduce step can take all of these output objects (which can be sorted as
you wish) and do whatever you want with them. 

We usually replicate our data 2x in Hadoop Distributed File System, so that big
runs don't fail due to single failures of disk drives. 

HTH. -Marshall

On 10/5/2011 2:24 PM, Greg Holmberg wrote:
> On Tue, 27 Sep 2011 01:06:02 -0700, Thilo Götz <twgoetz@gmx.de> wrote:
>
>> On 26/09/11 22:31, Greg Holmberg wrote:
>>>
>>> This is what I'm doing.  I use JavaSpaces (producer/consumer queue), but I'm
>>> sure you can get the same effect with UIMA AS and ActiveMQ.
>>
>> Or Hadoop.
>
> Thilo, could you expand on this?  Exactly how do you use Hadoop to scale UIMA?
>
> What storage do you use under Hadoop (HDFS, Hbase, Hive, etc), and what is
> your final storage destination for the CAS data?
>
> Are you doing on-demand, streaming, or batch processing of documents?
>
> What are your key/value pairs?  URLs?  What's your map step, what's your
> reduce step?
>
> How do you partition?  Do you find the system is load balanced?  What level of
> efficiency do you get?  What level of CPU utilization?
>
> Do you do just document (UIMA) analysis in Hadoop, or also collection
> (multi-doc) analytics?
>
> The fit between UIMA and Hadoop isn't obvious to me.  Just trying to figure it
> out.
>
> Thanks,
>
>
> Greg Holmberg
>

Mime
View raw message