The issue continues to exist with nightly 146 from Jul 10, 2007.
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/
Ard Schrijvers wrote:
> Hello,
>
> The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is
already addressed and solved...
>
> Regards Ard
>
>
>> Thank you for the reply Ard,
>>
>> The tokens exist in the index and are returned accurately, except for
>> the offsets. In this case I am not dealing with the positions, so the
>> termvector is specified as using 'with_offsets'. I have left the term
>> position incrememt as its default. Looking at the existing
>> tokenstreams,
>> they don't maintain knowledge of the current position, they always
>> generate values startoffsets beginning at 0 of the current
>> stream, and
>> then a 'proper' offset is generated based on the +1 of the previous
>> token the DocumentWriter applies when indexeding. Nor are
>> there any test
>> cases for offsets. I found a bug that was opened a while ago dealing
>> with this issue (as well as related one). It is:
>> https://issues.apache.org/jira/browse/LUCENE-579
>>
>> I am retrieving the a text token's offset values using
>> TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[].
>> The same offset values that were placed into the token during
>> indexing
>> are not being returned, they have been shifted.
>> Thanks.
>> Shahan
>>
>> Ard Schrijvers wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>>> Hi,
>>>> I am storing custom values in the Tokens provided by a
>>>>
>> Tokenizer but
>>
>>>> when retrieving them from the index the values don't match.
>>>>
>>>>
>>> What do you mean by retrieving? Do you mean retrieving
>>>
>> terms, or do you mean doing a search with words you know that
>> should be in, but you do not find a match?
>>
>>> In the latter, you must make sure that you are using the
>>>
>> same analyzer for the search as you used for indexing.
>>
>>>
>>>
>>>> I've looked
>>>> in the LIA book but it's not current since it mentioned
>>>>
>> term vectors
>>
>>>> aren't stored. I'm using Lucene Nightly 146 but the same thing has
>>>> happened with older versions. Looking at the internals,
>>>> DocumentWriter
>>>> seems to keep track of the end offset that was placed into
>>>> the index and
>>>> modifies the token values (with +1) but I'm not sure whether
>>>> I should be
>>>> concerned with it.
>>>> No existing analyzers are used when adding the document so all the
>>>> offsets are generated manually.
>>>> Any suggestions of how the token offsets should be stored?
>>>>
>>>>
>>>>
>>> Look at other clases that implement TokenStream. Also take
>>>
>> a look at setPositionIncrement when you are putting in your own terms
>>
>>> Regards Ard
>>>
>>>
>>>
>>>> Is this valid?
>>>> Token, start, end
>>>> aaa, 0, 3
>>>> bbb, 4, 7
>>>> ccc, 8, 11
>>>>
>>>> Thanks,
>>>> Shahan
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>>
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>> ---------------------------------------------------------------------
>>
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|