Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
Message-ID: <4697ABCF.1090208@cs.toronto.edu>
Date: Fri, 13 Jul 2007 12:43:59 -0400
From: Shahan Khatchadourian <shahan@cs.toronto.edu>
User-Agent: Thunderbird 1.5.0.12 (Windows/20070509)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Token offset values for custom Tokenizer
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,
I am storing custom values in the Tokens provided by a Tokenizer but 
when retrieving them from the index the values don't match. I've looked 
in the LIA book but it's not current since it mentioned term vectors 
aren't stored. I'm using Lucene Nightly 146 but the same thing has 
happened with older versions. Looking at the internals, DocumentWriter 
seems to keep track of the end offset that was placed into the index and 
modifies the token values (with +1) but I'm not sure whether I should be 
concerned with it.
No existing analyzers are used when adding the document so all the 
offsets are generated manually.
Any suggestions of how the token offsets should be stored?

Is this valid?
Token, start, end
aaa, 0, 3
bbb, 4, 7
ccc, 8, 11

Thanks,
Shahan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org