On Tue, Apr 27, 2010 at 10:59 AM, Thilo Goetz <twgoetz@gmx.de> wrote:
> My understanding is that he wants the tokens as primitives,
> not the characters. Annotation offsets could then be token
> offsets, not character offsets. That's perfectly reasonable
> for some tasks. We usually create annotations with the start
> offset being the start of some token, and the end offset the
> end of some token. Then it's hard to find the tokens that
> are "covered" by the annotation, which is why we have
> subiterators, which are not super efficient. And so on.
> I like the idea, but I have no idea how compatible it is with
> UIMA's idea of views and sofas.
A StringArrayFS can be used as Sofa data. Moreover, a new
annotation type derived from AnnotationBase can be used
to point into the StringArray, and if using JCas it could have
a getCoveredText() method or other functional capabilities.
Thanks for explaining the scenario!
Eddie
|