From user-return-3007-apmail-uima-user-archive=uima.apache.org@uima.apache.org Tue Apr 27 14:17:51 2010 Return-Path: Delivered-To: apmail-uima-user-archive@www.apache.org Received: (qmail 49245 invoked from network); 27 Apr 2010 14:17:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Apr 2010 14:17:51 -0000 Received: (qmail 12309 invoked by uid 500); 27 Apr 2010 14:17:50 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 12285 invoked by uid 500); 27 Apr 2010 14:17:50 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 12277 invoked by uid 99); 27 Apr 2010 14:17:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Apr 2010 14:17:50 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=10.0 tests=AWL,FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of eaepstein@gmail.com designates 74.125.82.47 as permitted sender) Received: from [74.125.82.47] (HELO mail-ww0-f47.google.com) (74.125.82.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Apr 2010 14:17:43 +0000 Received: by wwb17 with SMTP id 17so1540459wwb.6 for ; Tue, 27 Apr 2010 07:17:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=SXM9qQPmuLNkXRPlpkY2wNjAr8Knc2n2GHIo+myOYQo=; b=REUifv+aa4g43MU9XxS9IhFGSd2wMbgsPAWWnNno1DTA04oxB/t08r9sL6vfjKlkCi hu3LWXtlDmq5JBt55TO55Zmk+yl4fMPBFCEoAsMhR49m5qyC8BzLIOwHpw0zCDMW5Rby K5nRWRhWi2kBU3jB1bMyRKuAllyMXZ9lEmDtE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=sdtQWUznNbn/BypaVJcw8DY/5MSkgTZHeKqqasM7vXTzyLCEn64GexraDx4voXzowY Mw46yIbUw/Lz2kcAfpUvy+kTeFwde65VyTzAOtrUwKauQYSNcpqX+evNyRLcGyy3ruzF TnRSfFnWzAhs1UM9CIZGvoVHB7Nfuy8jAy5IU= MIME-Version: 1.0 Received: by 10.216.90.206 with SMTP id e56mr2253186wef.167.1272377839705; Tue, 27 Apr 2010 07:17:19 -0700 (PDT) Received: by 10.216.165.208 with HTTP; Tue, 27 Apr 2010 07:17:19 -0700 (PDT) In-Reply-To: References: <4BD5836A.50603@schor.com> Date: Tue, 27 Apr 2010 10:17:19 -0400 Message-ID: Subject: Re: Restrictions on sofa data array From: Eddie Epstein To: user@uima.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi, On Mon, Apr 26, 2010 at 9:56 AM, Klaus Rothenh=E4usler = wrote: > That's the way I'm using UIMA right now. However, as practically all > downstream annotators work on tokens, I would find it much more > intuitive if I could assign annotations as indices into an array of > tokens. This is especially true for annotations spanning several > tokens where the input document contains additional markup. In this > case using the begin and end offsets of the first and last token the > annotation spans may include unwanted markup. It is clear to me that I > could define a view containing only the plain text but I'd rather work > on a string of tokens which for downstream processors I'd consider > just as much unstructured data as a string of characters is for a > tokenizer. Having the tokens stored in the data array would have the > benefit of efficient random access instead of having to iterate over > an annotation index. Is the string of tokens essentially the same as a detagged XML document? Creating a view where the Sofa is detagged text is a common scenario. It may be useful to keep a cross reference in the detagged text view betwee= n tokens in this view with the same tokens in the original plain text view. Does this fit with your scenario? Regards, Eddie