uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Tanenblatt <sloth...@park-slope.net>
Subject Re: How to tokenize during Annotator initialization?
Date Tue, 18 Aug 2009 19:49:32 GMT
You can look at the way ConceptMapper tokenizes it's dictionaries,  
which are external resources and are tokenized when they are loaded.  
The source is in the sandbox.

On Aug 18, 2009, at 3:15 PM, David Dearing  
<ddearing@stottlerhenke.com> wrote:

> Hi everyone,
>
> I'm just getting started with UIMA and have poked through the docs and
> the sandbox, but still have some questions on best/recommended  
> practices.
>
> A simple example of my question is with stop word processing of text.
> Processing is broken up into Tokenizer -> Stemmer ->  
> StopWordAnnotator.
>
> The tokenizer and stemmer are straightforward.  We can create our  
> own or
> swap in modules such as the sandbox WhitespaceTokenizer or
> SnowballAnnotator (stemming).
>
> My concern is that during initialize(...) of the StopWordAnnotator I
> load a resource file that contains the list of stop words.  These stop
> words need to be tokenized and stemmed as well (probably in the same
> manner as the previous steps, but perhaps configurable).
>
> What is the best practice on doing this?  Specifying an aggregate
> analysis engine that runs over the stop word list within the
> initialize() method?  That seems a bit strange (and would maybe quite
> complicated as later annotators have more complex processing), but I
> haven't yet seen examples for this type of complex, resource-based
> annotator.
>
> Thanks for taking the time to read/help!
> Dave

Mime
View raw message