lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Delbru <renaud.del...@deri.org>
Subject Incorrect Token Offset when using multiple fieldable instance
Date Tue, 04 Mar 2008 18:04:48 GMT
Hi,

I currently use multiple fieldable instances for indexing sentences of a 
document.
When there is only one single fieldable instance, the token offset 
generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable instances. In 
DocumentWriter$FieldData#invertField method, if the field is tokenized, 
instead of updating offset attribute with stringValue.length() (which is 
performed if the field is not tokenized, line 1458), you update the 
offset attribute with the end offset of the last token (line 1503: 
offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a stopword, 
a dot, a space, etc.), the offset attribute is updated with the end 
offset of the last token not filtered. In this case, you store inside 
the offset attribute an incorrect offset (the offset is shift back) and 
all the next fieldable instances will have their offset shifted back.

Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

Regards.

-- 
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message