Hi,
On Mon, Apr 26, 2010 at 9:56 AM, Klaus Rothenhäusler <rothenha@gmail.com> wrote:
> That's the way I'm using UIMA right now. However, as practically all
> downstream annotators work on tokens, I would find it much more
> intuitive if I could assign annotations as indices into an array of
> tokens. This is especially true for annotations spanning several
> tokens where the input document contains additional markup. In this
> case using the begin and end offsets of the first and last token the
> annotation spans may include unwanted markup. It is clear to me that I
> could define a view containing only the plain text but I'd rather work
> on a string of tokens which for downstream processors I'd consider
> just as much unstructured data as a string of characters is for a
> tokenizer. Having the tokens stored in the data array would have the
> benefit of efficient random access instead of having to iterate over
> an annotation index.
Is the string of tokens essentially the same as a detagged XML document?
Creating a view where the Sofa is detagged text is a common scenario.
It may be useful to keep a cross reference in the detagged text view between
tokens in this view with the same tokens in the original plain text view.
Does this fit with your scenario?
Regards,
Eddie
|