uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anuj Kumar <anujs...@gmail.com>
Subject Re: How to process structured input with UIMA?
Date Wed, 02 Mar 2011 15:01:15 GMT
Sounds good.. All the best!

- Anuj

On Wed, Mar 2, 2011 at 7:55 PM, Andreas Kahl <Andreas_Kahl@gmx.net> wrote:

> Anuj and Jan,
>
> Thank you very much for your tips. I think, I will try the annotation-way:
> Use an CollectionProcessingEngine to iterate all the Docs in my input-XML.
> Instatiate a CAS with the input-XML as text.
> Then run an Annotator converting all XML-Tags into Annotations (I think I
> am going to set annotation.setBegin() and .setEnd() to something generic
> like 0).
> Based on that I'm going to build up my Pipeline.
> I'll keep you posted as soon as I have some results.
>
> Best Regards
> Andreas
>
>
>
> -------- Original-Nachricht --------
> > Datum: Wed, 02 Mar 2011 11:46:06 +0100
> > Von: "Jörn Kottmann" <kottmann@gmail.com>
> > An: user@uima.apache.org
> > Betreff: Re: How to process structured input with UIMA?
>
> > On 3/2/11 11:14 AM, Andreas Kahl wrote:
> > > Mainly I am concerned with the latter:
> > > Those metadata-records would come in as XML with dozens of fields
> > containing relatively short texts (most less than 255chars). We need to
> perform
> > NLP (tokenization, stemming ...) and some simpler manipulations like
> reading
> > 3 fields and constructing a 4th from that.
> > > It would be very desirable to use one Framework for both tasks (in fact
> > we would use the pipeline to enrich the Metadata-Records with the long
> > texts).
> > >
> >
> > You could take the xml, parse it and then construct a short text which
> > contains the content togehter
> > with annoations to mark the existing structure. This new text with the
> > annotations will be placed in a new view.
> > Afterward you can perform your processing within these annotation bounds.
> >
> > Not sure how you construct the 4th field, but when you can do that
> > directly after
> > the xml parsing it could be part of the constructed text.
> >
> > With UIMA-AS you should be able to nicely scale the analysis to a few
> > machines.
> >
> > Hope that helps,
> > Jörn
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message