Return-Path: Delivered-To: apmail-uima-user-archive@www.apache.org Received: (qmail 94165 invoked from network); 2 Mar 2011 10:31:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Mar 2011 10:31:34 -0000 Received: (qmail 35429 invoked by uid 500); 2 Mar 2011 10:31:34 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 35125 invoked by uid 500); 2 Mar 2011 10:31:31 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 35117 invoked by uid 99); 2 Mar 2011 10:31:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 10:31:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anujsays@gmail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 10:31:24 +0000 Received: by vxd7 with SMTP id 7so5097862vxd.6 for ; Wed, 02 Mar 2011 02:31:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=PGR8KQGQ2EJrIg3olFO75UPDJbGiBG8zI+yhuYFyNPk=; b=bVgHcBjFIhhZRNsPTdrhu+AVm2JKlfaT390ZAfiIDmTXGqgw8+Dqa1BCUj5lMGmwac KKG5qrTP19G1c+w0gM4MgEFoUGuOLJ4KP/RfKCKCDHcz3hj4RMoaOKnFmXojfq8tJEhz bVVAAu33XiwT3h7INCP0A31q4AYirclLfss4c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=PQ1BH4juqE83cL/k0RRDphX7ES1hNXiO1VVqfBqbgfpxfzj9upzVX5kcN5l8CzpyiV 8OW9jeuhSYHN2bHzwKlhHYzJUJOqGKA1PW75yfhxKe0rJ9UEZxihmxtJ8/zVOzAPwk4o WgV1QQvaLv0a2oSKOdJLebpG4OlY2Gw8cZrG4= MIME-Version: 1.0 Received: by 10.220.176.13 with SMTP id bc13mr2018671vcb.82.1299061862681; Wed, 02 Mar 2011 02:31:02 -0800 (PST) Received: by 10.220.163.204 with HTTP; Wed, 2 Mar 2011 02:31:02 -0800 (PST) In-Reply-To: <20110302101445.274300@gmx.net> References: <20110302101445.274300@gmx.net> Date: Wed, 2 Mar 2011 16:01:02 +0530 Message-ID: Subject: Re: How to process structured input with UIMA? From: Anuj Kumar To: user@uima.apache.org Cc: Andreas Kahl Content-Type: multipart/alternative; boundary=90e6ba4fbd0cddd01b049d7d68a4 --90e6ba4fbd0cddd01b049d7d68a4 Content-Type: text/plain; charset=ISO-8859-1 Hi Andreas, You can create an annotator type with the fields that are present in your metadata and in the annotator implementation you can read that XML and populate the fields. Once you have the data for the corresponding fields, you can do language detection, linguistic normalization, entity extraction etc. and structure the content accordingly. If you need to merge two fields and store it as a new field then define that field as well in the annotator type. Hope it helps. Regards, Anuj On Wed, Mar 2, 2011 at 3:44 PM, Andreas Kahl wrote: > Hello everyone, > > I am currently evaluating UIMA as a possible unified document processing > framework for our data. On the one hand we need to process large, > unstructured texts (language detection, linguistic normalization, entity > extraction etc.), on the other hand we have millions of structured > Metadata-Records to process. > > Mainly I am concerned with the latter: > Those metadata-records would come in as XML with dozens of fields > containing relatively short texts (most less than 255chars). We need to > perform NLP (tokenization, stemming ...) and some simpler manipulations like > reading 3 fields and constructing a 4th from that. > It would be very desirable to use one Framework for both tasks (in fact we > would use the pipeline to enrich the Metadata-Records with the long texts). > > Reading the documentation I can imagine three different ways to process > structured (XML-)documents: > 1. Find some way to add multiple text fields into one CAS, so a > CAS-Processor (Analysis Engine) can access multiple of those fields at a > time and manipulate them. (not cas.setDocumentText() - as I understand this > would imply to have only one input field) Is there an Interface in the > Collection Processing Engine to map XML-Fields to CAS-Fields? > 2. Or am I better off using multiple CAS-Views? (but the XML-Fields are not > different representations of the same content, they contain disjunct > categories like author or title) > 3. Is there possibly some smart way to generate multiple Sub-CASes, each > containing one field? > In cases 2 and 3 I am unsure whether my Analysis Engines would still be > able to access multiple fields at once. > > Which of those are feasible at all, and which would you recommend to use? > > Thanks for any hints or experiences. > > Andreas > --90e6ba4fbd0cddd01b049d7d68a4--