lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahan Khatchadourian <sha...@cs.toronto.edu>
Subject Re: Token offset values for custom Tokenizer
Date Mon, 16 Jul 2007 15:46:58 GMT
The issue continues to exist with nightly 146 from Jul 10, 2007.

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/


Ard Schrijvers wrote:
> Hello,
>
> The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is
already addressed and solved...
>
> Regards Ard
>
>   
>> Thank you for the reply Ard,
>>
>> The tokens exist in the index and are returned accurately, except for 
>> the offsets. In this case I am not dealing with the positions, so the 
>> termvector is specified as using 'with_offsets'. I have left the term 
>> position incrememt as its default. Looking at the existing 
>> tokenstreams, 
>> they don't maintain knowledge of the current position, they always 
>> generate values startoffsets beginning at 0 of the current 
>> stream, and 
>> then a 'proper' offset is generated based on the +1 of the previous 
>> token the DocumentWriter applies when indexeding. Nor are 
>> there any test 
>> cases for offsets. I found a bug that was opened a while ago dealing 
>> with this issue (as well as related one). It is:
>> https://issues.apache.org/jira/browse/LUCENE-579
>>
>> I am retrieving the a text token's offset values using 
>> TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[]. 
>> The same offset values that were placed into the token during 
>> indexing 
>> are not being returned, they have been shifted.
>> Thanks.
>> Shahan
>>
>> Ard Schrijvers wrote:
>>     
>>> Hello,
>>>
>>>   
>>>       
>>>> Hi,
>>>> I am storing custom values in the Tokens provided by a 
>>>>         
>> Tokenizer but 
>>     
>>>> when retrieving them from the index the values don't match. 
>>>>     
>>>>         
>>> What do you mean by retrieving? Do you mean retrieving 
>>>       
>> terms, or do you mean doing a search with words you know that 
>> should be in, but you do not find a match?
>>     
>>> In the latter, you must make sure that you are using the 
>>>       
>> same analyzer for the search as you used for indexing. 
>>     
>>>   
>>>       
>>>> I've looked 
>>>> in the LIA book but it's not current since it mentioned 
>>>>         
>> term vectors 
>>     
>>>> aren't stored. I'm using Lucene Nightly 146 but the same thing has 
>>>> happened with older versions. Looking at the internals, 
>>>> DocumentWriter 
>>>> seems to keep track of the end offset that was placed into 
>>>> the index and 
>>>> modifies the token values (with +1) but I'm not sure whether 
>>>> I should be 
>>>> concerned with it.
>>>> No existing analyzers are used when adding the document so all the 
>>>> offsets are generated manually.
>>>> Any suggestions of how the token offsets should be stored?
>>>>
>>>>     
>>>>         
>>> Look at other clases that implement TokenStream. Also take 
>>>       
>> a look at setPositionIncrement when you are putting in your own terms
>>     
>>> Regards Ard
>>>
>>>   
>>>       
>>>> Is this valid?
>>>> Token, start, end
>>>> aaa, 0, 3
>>>> bbb, 4, 7
>>>> ccc, 8, 11
>>>>
>>>> Thanks,
>>>> Shahan
>>>>
>>>>
>>>>         
>> ---------------------------------------------------------------------
>>     
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>     
>>>>         
>>>       
>> ---------------------------------------------------------------------
>>     
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>   
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message