lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahan Khatchadourian <>
Subject Re: Token offset values for custom Tokenizer
Date Mon, 16 Jul 2007 15:33:36 GMT
Thank you for the reply Ard,

The tokens exist in the index and are returned accurately, except for 
the offsets. In this case I am not dealing with the positions, so the 
termvector is specified as using 'with_offsets'. I have left the term 
position incrememt as its default. Looking at the existing tokenstreams, 
they don't maintain knowledge of the current position, they always 
generate values startoffsets beginning at 0 of the current stream, and 
then a 'proper' offset is generated based on the +1 of the previous 
token the DocumentWriter applies when indexeding. Nor are there any test 
cases for offsets. I found a bug that was opened a while ago dealing 
with this issue (as well as related one). It is:

I am retrieving the a text token's offset values using 
TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[]. 
The same offset values that were placed into the token during indexing 
are not being returned, they have been shifted.

Ard Schrijvers wrote:
> Hello,
>> Hi,
>> I am storing custom values in the Tokens provided by a Tokenizer but 
>> when retrieving them from the index the values don't match. 
> What do you mean by retrieving? Do you mean retrieving terms, or do you mean doing a
search with words you know that should be in, but you do not find a match?
> In the latter, you must make sure that you are using the same analyzer for the search
as you used for indexing. 
>> I've looked 
>> in the LIA book but it's not current since it mentioned term vectors 
>> aren't stored. I'm using Lucene Nightly 146 but the same thing has 
>> happened with older versions. Looking at the internals, 
>> DocumentWriter 
>> seems to keep track of the end offset that was placed into 
>> the index and 
>> modifies the token values (with +1) but I'm not sure whether 
>> I should be 
>> concerned with it.
>> No existing analyzers are used when adding the document so all the 
>> offsets are generated manually.
>> Any suggestions of how the token offsets should be stored?
> Look at other clases that implement TokenStream. Also take a look at setPositionIncrement
when you are putting in your own terms
> Regards Ard
>> Is this valid?
>> Token, start, end
>> aaa, 0, 3
>> bbb, 4, 7
>> ccc, 8, 11
>> Thanks,
>> Shahan
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message