Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E67910331 for ; Thu, 18 Jul 2013 07:55:31 +0000 (UTC) Received: (qmail 24913 invoked by uid 500); 18 Jul 2013 07:55:29 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 24877 invoked by uid 500); 18 Jul 2013 07:55:25 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 24866 invoked by uid 99); 18 Jul 2013 07:55:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 07:55:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of richard.eckart@gmail.com designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-ee0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 07:55:17 +0000 Received: by mail-ee0-f48.google.com with SMTP id b47so1508559eek.7 for ; Thu, 18 Jul 2013 00:54:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=Kd4cZJ4tzpRul2nwjP29j2gJrYJymj/YKVuZ9VyQ9Q8=; b=VkJkkk7z8x0+PwVEW1bmanWPPfkgzghccMGym7Wg/DuHj0XKhMafSxVP9qKkoi327G aCvBAWk3h13fAcEPx2RfVNaqvgb1EMMNa321xaQXY3hoF77URSSOBhI+B67awc3VYVML VovSDsTDPgHSlU06idOhywRwOoxDCacwjI0tquKap7KGmkgTBNGtGcLXnm9rRriDhaOg dzGLibYaWt1MjwUAUFJwhUBAu2h76F/8rl/5Ch6O+G1WFwfUlFzhJjcw0uP5CkWZLofb D6BbXs7oODuY36EuclKgKVph0viOSuRs0O2d4ZREV0xkOAlRS7MRCgDmfnpKHzAXoKqM 8DWQ== X-Received: by 10.15.83.69 with SMTP id b45mr9743172eez.150.1374134096360; Thu, 18 Jul 2013 00:54:56 -0700 (PDT) Received: from highfire.ukp.informatik.tu-darmstadt.de (macbook-rec.ukp.informatik.tu-darmstadt.de. [130.83.167.192]) by mx.google.com with ESMTPSA id i2sm17207197eeu.4.2013.07.18.00.54.54 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 18 Jul 2013 00:54:55 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Using uima pipeline as an API From: Richard Eckart de Castilho In-Reply-To: Date: Thu, 18 Jul 2013 09:54:54 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: user@uima.apache.org X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org > I have this particular requirement for a API that we wrap over a Uima=20= > pipeline. >=20 > public List analyse(String inputFolderPath, String modelName); >=20 > This method is supposed to accept a collection of files (residing in = the=20 > inputFolderPath), run the files (as CAS) through a pipeline of UIMA = AEs, and=20 > return the results (one String per CAS). >=20 > To return the strings, I will need to somehow access the CAS after the = AEs=20 > have finished their job and transform/extract whatever inside the CAS = into=20 > the string that I will return to the caller of this method. >=20 > But if I run the AEs using a SimplePipeline.runPipeline() > How I can get hold of the CAS that are coming out of the AEs? > Do I attach a CAS Consumer at the tail of the pipeline and read the = CAS=20 > contents at that point? Then I append each result to the List = that I=20 > constructed at the begining. You should take a look at the JCasIterable (cf. [1] - Example in Groovy, = but JCasIterable is a Java class and works nicely in Java too, just I have = no=20 example in Java). JCasIterable basically allows you to iterate over the CASes produced by = your pipeline. In such a look, you can extract and collect the data you need = from the CASes, e.g. putting into a List and returning it. Mind that = you should *not* try to keep hold of full CASes, FeatureStructure (including Annotations and stuff). You need to copy the data from the CAS, = otherwise it will be corrupted. > If so, is this scalable?=20 Well=85 up to a point, but not in general. > If I have thousands of files in the inputFolderPath, and if the = strings are=20 > very large, would I run out of memory soon? > Is there a more scalable way to do this? You could write your strings to a file and then return an implementation = of=20 List which directly accesses the file. Depending on how much you = want to scale, you'll have to look into different solutions. The easiest = would be to buy more memory, the most complex would probably be porting your = stuff to some kind of cluster. The latter will most likely require a change of = API, possibly even of the whole processing paradigm. List most = probably won't do then ;) Cheers, -- Richard [1] = http://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-o= f-speech_tagging_pipeline_using_JCasIterable_and_c=