Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (nike.apache.org: domain of twgoetz@gmx.de designates
 213.165.64.23 as permitted sender)
Message-ID: <4E8D3FFD.9020700@gmx.de>
Date: Thu, 06 Oct 2011 07:43:25 +0200
From: =?UTF-8?B?VGhpbG8gR8O2dHo=?= <twgoetz@gmx.de>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.9.2.23) Gecko/20110922 Lightning/1.0b2 Thunderbird/3.1.15
MIME-Version: 1.0
To: user@uima.apache.org
Subject: Re: Scaling using Hadoop
References: <58124e9c-8eb2-4efc-8aca-9bafa1802efc@ws28-arun-lin>
 <12c41e8c-eb6a-4b1e-9f6d-bd0035fde070@ws28-arun-lin>
 <CAO329yD-uqTemB9nSguaCCqeSh919m9_QSfhupVHCGWdBQ0Y4g@mail.gmail.com>
 <CA+U9pLt8GO62e99jG7Ex4vUDuG9Y0Pk93aR6VT4g6n0gkR5Yqg@mail.gmail.com>
 <loom.20110926T135132-521@post.gmane.org>
 <op.v2fnaqv0303kzn@oreo.hsd1.ca.comcast.net> <4E8183EA.8080406@gmx.de>
 <op.v2v5e7v0303kzn@oreo.hsd1.ca.comcast.net> <4E8CC162.1080203@schor.com>
In-Reply-To: <4E8CC162.1080203@schor.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On 05/10/11 22:43, Marshall Schor wrote:
> We use hadoop with UIMA.  Here's the "fit", in one case:
> 
> 1) UIMA runs as the map step; we put the uima pipeline into the mapper.  Hadoop
> has a configure (?) method where you can stick the creation and set up of the
> uima pipeline, similar to UIMA's initialize.
> 
> 2) Write a hadoop record reader that reads input from hadoop's "splits", and
> creates things that would go into individual CASes.  These are the input to the
> Map step.
> 
> 3) The map takes the input (a string, say), and puts it into a CAS, and then
> calls the process() method on the engine it set up and initialized in step 1.
> 
> 4) When the process method returns, the CAS has all the results - iterate thru
> it and extract whatever you want, and stick those values into your hadoop output
> object, and output it.
> 
> 5) The reduce step can take all of these output objects (which can be sorted as
> you wish) and do whatever you want with them. 

That basically sums it up.  We (and that's a different we than Marshall's we)
use hadoop only for batch processing, but since that's the only processing
we're currently doing, that works out well.  We use hdfs as the underlying
storage normally.

--Thilo

> 
> We usually replicate our data 2x in Hadoop Distributed File System, so that big
> runs don't fail due to single failures of disk drives. 
> 
> HTH. -Marshall
> 
> On 10/5/2011 2:24 PM, Greg Holmberg wrote:
>> On Tue, 27 Sep 2011 01:06:02 -0700, Thilo Götz <twgoetz@gmx.de> wrote:
>>
>>> On 26/09/11 22:31, Greg Holmberg wrote:
>>>>
>>>> This is what I'm doing.  I use JavaSpaces (producer/consumer queue), but I'm
>>>> sure you can get the same effect with UIMA AS and ActiveMQ.
>>>
>>> Or Hadoop.
>>
>> Thilo, could you expand on this?  Exactly how do you use Hadoop to scale UIMA?
>>
>> What storage do you use under Hadoop (HDFS, Hbase, Hive, etc), and what is
>> your final storage destination for the CAS data?
>>
>> Are you doing on-demand, streaming, or batch processing of documents?
>>
>> What are your key/value pairs?  URLs?  What's your map step, what's your
>> reduce step?
>>
>> How do you partition?  Do you find the system is load balanced?  What level of
>> efficiency do you get?  What level of CPU utilization?
>>
>> Do you do just document (UIMA) analysis in Hadoop, or also collection
>> (multi-doc) analytics?
>>
>> The fit between UIMA and Hadoop isn't obvious to me.  Just trying to figure it
>> out.
>>
>> Thanks,
>>
>>
>> Greg Holmberg
>>