uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Klaus Rothenhäusler <rothe...@gmail.com>
Subject Re: Restrictions on sofa data array
Date Mon, 26 Apr 2010 13:56:41 GMT
Hi,
 
> On 4/26/2010 5:20 AM, Klaus Rothenhäusler wrote:
> > Hallo,
> > could somebody explain to me why it is not possible to assign a data
> > array of other than primitive type or at least strings to a sofa? 
> 
> I may not be quite understanding the question - but you can  set the
> Subject Of Analysis (SOFA) to be the value of a string; see
> http://uima.apache.org/downloads/releaseDocs/2.3.0-
> incubating/docs/html/tutorials_and_users_guides/
> tutorials_and_users_guides.html#ugr.tug.aas.setting_sofa_data

Sorry, what I meant was an array of strings but ideally an array of
feature structures.

> > For
> > me it would be much more "natural" to have downstream annotators
> > produce annotations on a view containing an array of some token type
> > after a tokenizer AE has processed an input text.
> >   
> 
> Annotators have access to two kinds of input data: "unstructured" and
> "structured".  The "unstructured" is the Subject Of Analysis, and as
> such, is thought to be a text or binary data.  The "structured" is all
> the feature structures in the CAS, computed by previous annotators.
> 
> I think, in your use-case, the array of some token type is usually
> considered "structured" data (it's in the CAS already as feature
> structures of type "Token", for instance).  Let's suppose that the Token
> feature structure includes a "begin" and "end" (integer) features that
> identify where the token starts (assuming here, that the input in the
> SOFA is a linear string of characters - I make this assumption because
> in the more general case, it could be something else, for instance, a 2
> or 3 dimensional array representing pixels in 2 or 3 dimensional space). 
> 
> In this case, the usual practice is to have a downstream annotator take
> the strings represented by these offsets and do further processing with
> them (e.g., assign parts-of-speech, etc.).  The way an annotator would
> access these would be to access the Token feature structure, and use its
> features as data, and/or to use the begin / end offsets into the SOFA to
> access parts of that.
> 
> Does this clarify things?  If not, please ask more questions.

That's the way I'm using UIMA right now. However, as practically all
downstream annotators work on tokens, I would find it much more
intuitive if I could assign annotations as indices into an array of
tokens. This is especially true for annotations spanning several
tokens where the input document contains additional markup. In this
case using the begin and end offsets of the first and last token the
annotation spans may include unwanted markup. It is clear to me that I
could define a view containing only the plain text but I'd rather work
on a string of tokens which for downstream processors I'd consider
just as much unstructured data as a string of characters is for a
tokenizer. Having the tokens stored in the data array would have the
benefit of efficient random access instead of having to iterate over
an annotation index.

My solution at the moment is to define a tokenArray of type FSArray
with elements of type Token as a feature of a Doc annotation. For
downstream annotators I define an additional TokenAnnotation type
derived from Annotation which has a begin_index and end_index feature
that point into this array. I find this solution awkward though
because in my opinion the token array has nothing to do in the type
system. The token sequence is really just another view of a document.

I hope I could make myself a bit clearer.

Thanks for your reply
--Klaus



Mime
View raw message