uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Restrictions on sofa data array
Date Mon, 26 Apr 2010 12:13:30 GMT


On 4/26/2010 5:20 AM, Klaus Rothenhäusler wrote:
> Hallo,
> could somebody explain to me why it is not possible to assign a data
> array of other than primitive type or at least strings to a sofa? 

I may not be quite understanding the question - but you can  set the
Subject Of Analysis (SOFA) to be the value of a string; see
http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.aas.setting_sofa_data

> For
> me it would be much more "natural" to have downstream annotators
> produce annotations on a view containing an array of some token type
> after a tokenizer AE has processed an input text.
>   

Annotators have access to two kinds of input data: "unstructured" and
"structured".  The "unstructured" is the Subject Of Analysis, and as
such, is thought to be a text or binary data.  The "structured" is all
the feature structures in the CAS, computed by previous annotators.

I think, in your use-case, the array of some token type is usually
considered "structured" data (it's in the CAS already as feature
structures of type "Token", for instance).  Let's suppose that the Token
feature structure includes a "begin" and "end" (integer) features that
identify where the token starts (assuming here, that the input in the
SOFA is a linear string of characters - I make this assumption because
in the more general case, it could be something else, for instance, a 2
or 3 dimensional array representing pixels in 2 or 3 dimensional space). 

In this case, the usual practice is to have a downstream annotator take
the strings represented by these offsets and do further processing with
them (e.g., assign parts-of-speech, etc.).  The way an annotator would
access these would be to access the Token feature structure, and use its
features as data, and/or to use the begin / end offsets into the SOFA to
access parts of that.

Does this clarify things?  If not, please ask more questions.

-Marshall
> Thanks
> --Klaus Rothenhäusler
>
>
>
>   

Mime
View raw message