lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Delbru <renaud.del...@deri.org>
Subject Re: Incorrect Token Offset when using multiple fieldable instance
Date Wed, 05 Mar 2008 09:52:30 GMT
Do you know if there will be side-effects if we replace in 
DocumentWriter$FieldData#invertField
offset = offsetEnd+1;
by
offset = stringValue.length();

I still not understand the reason of such choice for the incrementation 
of the start offset.

Regards.

Michael McCandless wrote:
>
> This is how Lucene has worked for quite some time (since 1.9).
>
> When there are multiple fields with the same name in one Document, 
> each field's offset starts from the last offset (offset of the last 
> token) seen in the previous field.  If tokens are skipped at the end 
> there's no way IndexWriter can know (because tokenStream doesn't 
> return them).  It's as if we need the ability to query a tokenStream 
> for its "final" offset or something.
>
> One workaround might be to insert an "end marker" token, with the true 
> end offset, which is a term you would never search on?
>
> Mike
>
> Renaud Delbru wrote:
>
>> Hi,
>>
>> I currently use multiple fieldable instances for indexing sentences 
>> of a document.
>> When there is only one single fieldable instance, the token offset 
>> generation performed in DocumentWriter is correct.
>> The problem appears when there is two or more fieldable instances. In 
>> DocumentWriter$FieldData#invertField method, if the field is 
>> tokenized, instead of updating offset attribute with 
>> stringValue.length() (which is performed if the field is not 
>> tokenized, line 1458), you update the offset attribute with the end 
>> offset of the last token (line 1503: offset = offsetEnd+1;).
>> As a consequence, if a token has been filtered (for example a 
>> stopword, a dot, a space, etc.), the offset attribute is updated with 
>> the end offset of the last token not filtered. In this case, you 
>> store inside the offset attribute an incorrect offset (the offset is 
>> shift back) and all the next fieldable instances will have their 
>> offset shifted back.
>>
>> Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
-- 
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message