Return-Path: Delivered-To: apmail-incubator-uima-user-archive@minotaur.apache.org Received: (qmail 31616 invoked from network); 18 Aug 2009 19:09:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Aug 2009 19:09:37 -0000 Received: (qmail 32577 invoked by uid 500); 18 Aug 2009 19:09:56 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 32518 invoked by uid 500); 18 Aug 2009 19:09:55 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 32508 invoked by uid 99); 18 Aug 2009 19:09:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 19:09:55 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.117.19.3] (HELO mail1.stottlerhenke.com) (209.117.19.3) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 19:09:46 +0000 Received: from [10.0.0.19] ([10.0.0.19]) by mail1.stottlerhenke.com (8.14.0/8.14.2) with ESMTP id n7IJ9O59019656 for ; Tue, 18 Aug 2009 12:09:24 -0700 Message-ID: <4A8AFDD2.5070904@stottlerhenke.com> Date: Tue, 18 Aug 2009 12:15:30 -0700 From: David Dearing User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: uima-user@incubator.apache.org Subject: How to tokenize during Annotator initialization? X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.0 (mail1.stottlerhenke.com [10.0.0.3]); Tue, 18 Aug 2009 12:09:24 -0700 (PDT) X-Virus-Scanned: ClamAV 0.91.2/9712/Tue Aug 18 10:56:37 2009 on mail1.stottlerhenke.com X-Virus-Status: Clean X-Virus-Checked: Checked by ClamAV on apache.org Hi everyone, I'm just getting started with UIMA and have poked through the docs and the sandbox, but still have some questions on best/recommended practices. A simple example of my question is with stop word processing of text. Processing is broken up into Tokenizer -> Stemmer -> StopWordAnnotator. The tokenizer and stemmer are straightforward. We can create our own or swap in modules such as the sandbox WhitespaceTokenizer or SnowballAnnotator (stemming). My concern is that during initialize(...) of the StopWordAnnotator I load a resource file that contains the list of stop words. These stop words need to be tokenized and stemmed as well (probably in the same manner as the previous steps, but perhaps configurable). What is the best practice on doing this? Specifying an aggregate analysis engine that runs over the stop word list within the initialize() method? That seems a bit strange (and would maybe quite complicated as later annotators have more complex processing), but I haven't yet seen examples for this type of complex, resource-based annotator. Thanks for taking the time to read/help! Dave