uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Epstein <eaepst...@gmail.com>
Subject Re: Restrictions on sofa data array
Date Tue, 27 Apr 2010 14:17:19 GMT

On Mon, Apr 26, 2010 at 9:56 AM, Klaus Rothenhäusler <rothenha@gmail.com> wrote:
> That's the way I'm using UIMA right now. However, as practically all
> downstream annotators work on tokens, I would find it much more
> intuitive if I could assign annotations as indices into an array of
> tokens. This is especially true for annotations spanning several
> tokens where the input document contains additional markup. In this
> case using the begin and end offsets of the first and last token the
> annotation spans may include unwanted markup. It is clear to me that I
> could define a view containing only the plain text but I'd rather work
> on a string of tokens which for downstream processors I'd consider
> just as much unstructured data as a string of characters is for a
> tokenizer. Having the tokens stored in the data array would have the
> benefit of efficient random access instead of having to iterate over
> an annotation index.

Is the string of tokens essentially the same as a detagged XML document?
Creating a view where the Sofa is detagged text is a common scenario.
It may be useful to keep a cross reference in the detagged text view between
tokens in this view with the same tokens in the original plain text view.

Does this fit with your scenario?

View raw message