Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (nike.apache.org: domain of twgoetz@gmx.de designates
 213.165.64.20 as permitted sender)
Message-ID: <4BD6FBD5.1090900@gmx.de>
Date: Tue, 27 Apr 2010 16:59:33 +0200
From: Thilo Goetz <twgoetz@gmx.de>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
 rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4
MIME-Version: 1.0
To: user@uima.apache.org
Subject: Re: Restrictions on sofa data array
References: <loom.20100426T105447-208@post.gmane.org>
	 <4BD5836A.50603@schor.com> <loom.20100426T143907-932@post.gmane.org>
 <o2zcd6edfd31004270717k6e0b16ffifae3e9f21bcf22e4@mail.gmail.com>
In-Reply-To: <o2zcd6edfd31004270717k6e0b16ffifae3e9f21bcf22e4@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit

On 4/27/2010 16:17, Eddie Epstein wrote:
> Hi,
> 
> On Mon, Apr 26, 2010 at 9:56 AM, Klaus Rothenh�usler <rothenha@gmail.com> wrote:
>> That's the way I'm using UIMA right now. However, as practically all
>> downstream annotators work on tokens, I would find it much more
>> intuitive if I could assign annotations as indices into an array of
>> tokens. This is especially true for annotations spanning several
>> tokens where the input document contains additional markup. In this
>> case using the begin and end offsets of the first and last token the
>> annotation spans may include unwanted markup. It is clear to me that I
>> could define a view containing only the plain text but I'd rather work
>> on a string of tokens which for downstream processors I'd consider
>> just as much unstructured data as a string of characters is for a
>> tokenizer. Having the tokens stored in the data array would have the
>> benefit of efficient random access instead of having to iterate over
>> an annotation index.
> 
> Is the string of tokens essentially the same as a detagged XML document?
> Creating a view where the Sofa is detagged text is a common scenario.
> It may be useful to keep a cross reference in the detagged text view between
> tokens in this view with the same tokens in the original plain text view.
> 
> Does this fit with your scenario?
> Regards,
> Eddie

My understanding is that he wants the tokens as primitives,
not the characters.  Annotation offsets could then be token
offsets, not character offsets.  That's perfectly reasonable
for some tasks.  We usually create annotations with the start
offset being the start of some token, and the end offset the
end of some token.  Then it's hard to find the tokens that
are "covered" by the annotation, which is why we have
subiterators, which are not super efficient.  And so on.
I like the idea, but I have no idea how compatible it is with
UIMA's idea of views and sofas.

--Thilo