Return-Path: Delivered-To: apmail-uima-user-archive@www.apache.org Received: (qmail 68871 invoked from network); 27 Apr 2010 15:00:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Apr 2010 15:00:05 -0000 Received: (qmail 14924 invoked by uid 500); 27 Apr 2010 15:00:05 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 14902 invoked by uid 500); 27 Apr 2010 15:00:05 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 14894 invoked by uid 99); 27 Apr 2010 15:00:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Apr 2010 15:00:05 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of twgoetz@gmx.de designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 27 Apr 2010 14:59:56 +0000 Received: (qmail invoked by alias); 27 Apr 2010 14:59:36 -0000 Received: from deibp9eh1--blueice3n1.emea.ibm.com (EHLO [9.152.14.84]) [195.212.29.179] by mail.gmx.net (mp017) with SMTP; 27 Apr 2010 16:59:36 +0200 X-Authenticated: #25330878 X-Provags-ID: V01U2FsdGVkX19u51RFl9duH5JQPFvHBNjFzZvZcm88r7S/1Cf+/8 UcMvf1oLYcEJw1 Message-ID: <4BD6FBD5.1090900@gmx.de> Date: Tue, 27 Apr 2010 16:59:33 +0200 From: Thilo Goetz User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: Restrictions on sofa data array References: <4BD5836A.50603@schor.com> In-Reply-To: X-Enigmail-Version: 1.0.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Y-GMX-Trusted: 0 X-FuHaFi: 0.59999999999999998 X-Virus-Checked: Checked by ClamAV on apache.org On 4/27/2010 16:17, Eddie Epstein wrote: > Hi, > > On Mon, Apr 26, 2010 at 9:56 AM, Klaus Rothenh�usler wrote: >> That's the way I'm using UIMA right now. However, as practically all >> downstream annotators work on tokens, I would find it much more >> intuitive if I could assign annotations as indices into an array of >> tokens. This is especially true for annotations spanning several >> tokens where the input document contains additional markup. In this >> case using the begin and end offsets of the first and last token the >> annotation spans may include unwanted markup. It is clear to me that I >> could define a view containing only the plain text but I'd rather work >> on a string of tokens which for downstream processors I'd consider >> just as much unstructured data as a string of characters is for a >> tokenizer. Having the tokens stored in the data array would have the >> benefit of efficient random access instead of having to iterate over >> an annotation index. > > Is the string of tokens essentially the same as a detagged XML document? > Creating a view where the Sofa is detagged text is a common scenario. > It may be useful to keep a cross reference in the detagged text view between > tokens in this view with the same tokens in the original plain text view. > > Does this fit with your scenario? > Regards, > Eddie My understanding is that he wants the tokens as primitives, not the characters. Annotation offsets could then be token offsets, not character offsets. That's perfectly reasonable for some tasks. We usually create annotations with the start offset being the start of some token, and the end offset the end of some token. Then it's hard to find the tokens that are "covered" by the annotation, which is why we have subiterators, which are not super efficient. And so on. I like the idea, but I have no idea how compatible it is with UIMA's idea of views and sofas. --Thilo