uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Dearing <ddear...@stottlerhenke.com>
Subject How to tokenize during Annotator initialization?
Date Tue, 18 Aug 2009 19:15:30 GMT
Hi everyone,

I'm just getting started with UIMA and have poked through the docs and
the sandbox, but still have some questions on best/recommended practices.

A simple example of my question is with stop word processing of text.
Processing is broken up into Tokenizer -> Stemmer -> StopWordAnnotator.

The tokenizer and stemmer are straightforward.  We can create our own or
swap in modules such as the sandbox WhitespaceTokenizer or
SnowballAnnotator (stemming).

My concern is that during initialize(...) of the StopWordAnnotator I
load a resource file that contains the list of stop words.  These stop
words need to be tokenized and stemmed as well (probably in the same
manner as the previous steps, but perhaps configurable).

What is the best practice on doing this?  Specifying an aggregate
analysis engine that runs over the stop word list within the
initialize() method?  That seems a bit strange (and would maybe quite
complicated as later annotators have more complex processing), but I
haven't yet seen examples for this type of complex, resource-based
annotator.

Thanks for taking the time to read/help!
Dave

Mime
View raw message