Return-Path: Delivered-To: apmail-uima-user-archive@www.apache.org Received: (qmail 45759 invoked from network); 2 Mar 2011 10:46:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Mar 2011 10:46:40 -0000 Received: (qmail 57515 invoked by uid 500); 2 Mar 2011 10:46:40 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 57402 invoked by uid 500); 2 Mar 2011 10:46:37 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 57393 invoked by uid 99); 2 Mar 2011 10:46:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 10:46:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates 209.85.215.47 as permitted sender) Received: from [209.85.215.47] (HELO mail-ew0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 10:46:29 +0000 Received: by ewy8 with SMTP id 8so1750368ewy.6 for ; Wed, 02 Mar 2011 02:46:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=CF2CLPJdNn/Rb8kg7/6sgvlgOFTppt+RqB6TIMzLVVU=; b=B6RKi6FLxxZNgdxIs7VgaVs1mhXKa75Nx8qJMIHuIA3kkoKpjZmBMk4GziXHK2aOGo ipQI0gSuTPWlpPrekCnDv4yFzpgZRCHq2JGZH57naTc9naj58kxddB7rGflLnMiwIT0c I2MwBom1wWxDWulTADv8JWBBfUcAHEe3nClbw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=Vtv7m6ajvp27kh5c0D61BqnA7IGHIM+5omt/zLO+0Gd4u1WigB/kTR63FofGqOlId4 hjcOZm1boTmuhTVAUd2oLXDwvurCp00Kp1Wn3yk6UZ52QPIpwSsIOExqu6jWPggnDMrN TSWnckHbM2ER5c1i+uxt/KMrLIYTHtZEFevQY= Received: by 10.213.25.140 with SMTP id z12mr6078737ebb.16.1299062767926; Wed, 02 Mar 2011 02:46:07 -0800 (PST) Received: from karkand.infopaq.net (dkcphfw01.infopaq.dk [213.150.59.2]) by mx.google.com with ESMTPS id t50sm5115124eeh.6.2011.03.02.02.46.07 (version=SSLv3 cipher=OTHER); Wed, 02 Mar 2011 02:46:07 -0800 (PST) Message-ID: <4D6E1FEE.9020808@gmail.com> Date: Wed, 02 Mar 2011 11:46:06 +0100 From: =?UTF-8?B?SsO2cm4gS290dG1hbm4=?= User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.14) Gecko/20110221 Thunderbird/3.1.8 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: How to process structured input with UIMA? References: <20110302101445.274300@gmx.net> In-Reply-To: <20110302101445.274300@gmx.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 3/2/11 11:14 AM, Andreas Kahl wrote: > Mainly I am concerned with the latter: > Those metadata-records would come in as XML with dozens of fields containing relatively short texts (most less than 255chars). We need to perform NLP (tokenization, stemming ...) and some simpler manipulations like reading 3 fields and constructing a 4th from that. > It would be very desirable to use one Framework for both tasks (in fact we would use the pipeline to enrich the Metadata-Records with the long texts). > You could take the xml, parse it and then construct a short text which contains the content togehter with annoations to mark the existing structure. This new text with the annotations will be placed in a new view. Afterward you can perform your processing within these annotation bounds. Not sure how you construct the 4th field, but when you can do that directly after the xml parsing it could be part of the constructed text. With UIMA-AS you should be able to nicely scale the analysis to a few machines. Hope that helps, Jörn