Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 87E597DAE for ; Thu, 6 Oct 2011 05:43:56 +0000 (UTC) Received: (qmail 51250 invoked by uid 500); 6 Oct 2011 05:43:56 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 51176 invoked by uid 500); 6 Oct 2011 05:43:55 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 51168 invoked by uid 99); 6 Oct 2011 05:43:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Oct 2011 05:43:55 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of twgoetz@gmx.de designates 213.165.64.23 as permitted sender) Received: from [213.165.64.23] (HELO mailout-de.gmx.net) (213.165.64.23) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 06 Oct 2011 05:43:47 +0000 Received: (qmail invoked by alias); 06 Oct 2011 05:43:27 -0000 Received: from p54829B44.dip.t-dialin.net (EHLO [192.168.0.3]) [84.130.155.68] by mail.gmx.net (mp043) with SMTP; 06 Oct 2011 07:43:27 +0200 X-Authenticated: #25330878 X-Provags-ID: V01U2FsdGVkX19ko+YdKqtkq1RTtJGQVf38XrLnWNJ1RVMmT98zyX wnxvXyeelfrdy3 Message-ID: <4E8D3FFD.9020700@gmx.de> Date: Thu, 06 Oct 2011 07:43:25 +0200 From: =?UTF-8?B?VGhpbG8gR8O2dHo=?= User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.23) Gecko/20110922 Lightning/1.0b2 Thunderbird/3.1.15 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: Scaling using Hadoop References: <58124e9c-8eb2-4efc-8aca-9bafa1802efc@ws28-arun-lin> <12c41e8c-eb6a-4b1e-9f6d-bd0035fde070@ws28-arun-lin> <4E8183EA.8080406@gmx.de> <4E8CC162.1080203@schor.com> In-Reply-To: <4E8CC162.1080203@schor.com> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Y-GMX-Trusted: 0 X-Virus-Checked: Checked by ClamAV on apache.org On 05/10/11 22:43, Marshall Schor wrote: > We use hadoop with UIMA. Here's the "fit", in one case: > > 1) UIMA runs as the map step; we put the uima pipeline into the mapper. Hadoop > has a configure (?) method where you can stick the creation and set up of the > uima pipeline, similar to UIMA's initialize. > > 2) Write a hadoop record reader that reads input from hadoop's "splits", and > creates things that would go into individual CASes. These are the input to the > Map step. > > 3) The map takes the input (a string, say), and puts it into a CAS, and then > calls the process() method on the engine it set up and initialized in step 1. > > 4) When the process method returns, the CAS has all the results - iterate thru > it and extract whatever you want, and stick those values into your hadoop output > object, and output it. > > 5) The reduce step can take all of these output objects (which can be sorted as > you wish) and do whatever you want with them. That basically sums it up. We (and that's a different we than Marshall's we) use hadoop only for batch processing, but since that's the only processing we're currently doing, that works out well. We use hdfs as the underlying storage normally. --Thilo > > We usually replicate our data 2x in Hadoop Distributed File System, so that big > runs don't fail due to single failures of disk drives. > > HTH. -Marshall > > On 10/5/2011 2:24 PM, Greg Holmberg wrote: >> On Tue, 27 Sep 2011 01:06:02 -0700, Thilo Götz wrote: >> >>> On 26/09/11 22:31, Greg Holmberg wrote: >>>> >>>> This is what I'm doing. I use JavaSpaces (producer/consumer queue), but I'm >>>> sure you can get the same effect with UIMA AS and ActiveMQ. >>> >>> Or Hadoop. >> >> Thilo, could you expand on this? Exactly how do you use Hadoop to scale UIMA? >> >> What storage do you use under Hadoop (HDFS, Hbase, Hive, etc), and what is >> your final storage destination for the CAS data? >> >> Are you doing on-demand, streaming, or batch processing of documents? >> >> What are your key/value pairs? URLs? What's your map step, what's your >> reduce step? >> >> How do you partition? Do you find the system is load balanced? What level of >> efficiency do you get? What level of CPU utilization? >> >> Do you do just document (UIMA) analysis in Hadoop, or also collection >> (multi-doc) analytics? >> >> The fit between UIMA and Hadoop isn't obvious to me. Just trying to figure it >> out. >> >> Thanks, >> >> >> Greg Holmberg >>