uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andreas Kahl" <Andreas_K...@gmx.net>
Subject Re: How to process structured input with UIMA?
Date Wed, 02 Mar 2011 14:25:29 GMT
Anuj and Jan, 

Thank you very much for your tips. I think, I will try the annotation-way: 
Use an CollectionProcessingEngine to iterate all the Docs in my input-XML.
Instatiate a CAS with the input-XML as text.
Then run an Annotator converting all XML-Tags into Annotations (I think I am going to set
annotation.setBegin() and .setEnd() to something generic like 0). 
Based on that I'm going to build up my Pipeline. 
I'll keep you posted as soon as I have some results. 

Best Regards
Andreas


 
-------- Original-Nachricht --------
> Datum: Wed, 02 Mar 2011 11:46:06 +0100
> Von: "Jörn Kottmann" <kottmann@gmail.com>
> An: user@uima.apache.org
> Betreff: Re: How to process structured input with UIMA?

> On 3/2/11 11:14 AM, Andreas Kahl wrote:
> > Mainly I am concerned with the latter:
> > Those metadata-records would come in as XML with dozens of fields
> containing relatively short texts (most less than 255chars). We need to perform
> NLP (tokenization, stemming ...) and some simpler manipulations like reading
> 3 fields and constructing a 4th from that.
> > It would be very desirable to use one Framework for both tasks (in fact
> we would use the pipeline to enrich the Metadata-Records with the long
> texts).
> >
> 
> You could take the xml, parse it and then construct a short text which 
> contains the content togehter
> with annoations to mark the existing structure. This new text with the 
> annotations will be placed in a new view.
> Afterward you can perform your processing within these annotation bounds.
> 
> Not sure how you construct the 4th field, but when you can do that 
> directly after
> the xml parsing it could be part of the constructed text.
> 
> With UIMA-AS you should be able to nicely scale the analysis to a few 
> machines.
> 
> Hope that helps,
> Jörn
> 

Mime
View raw message