lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Incorrect Token Offset when using multiple fieldable instance
Date Wed, 05 Mar 2008 09:15:36 GMT

This is how Lucene has worked for quite some time (since 1.9).

When there are multiple fields with the same name in one Document,  
each field's offset starts from the last offset (offset of the last  
token) seen in the previous field.  If tokens are skipped at the end  
there's no way IndexWriter can know (because tokenStream doesn't  
return them).  It's as if we need the ability to query a tokenStream  
for its "final" offset or something.

One workaround might be to insert an "end marker" token, with the  
true end offset, which is a term you would never search on?

Mike

Renaud Delbru wrote:

> Hi,
>
> I currently use multiple fieldable instances for indexing sentences  
> of a document.
> When there is only one single fieldable instance, the token offset  
> generation performed in DocumentWriter is correct.
> The problem appears when there is two or more fieldable instances.  
> In DocumentWriter$FieldData#invertField method, if the field is  
> tokenized, instead of updating offset attribute with  
> stringValue.length() (which is performed if the field is not  
> tokenized, line 1458), you update the offset attribute with the end  
> offset of the last token (line 1503: offset = offsetEnd+1;).
> As a consequence, if a token has been filtered (for example a  
> stopword, a dot, a space, etc.), the offset attribute is updated  
> with the end offset of the last token not filtered. In this case,  
> you store inside the offset attribute an incorrect offset (the  
> offset is shift back) and all the next fieldable instances will  
> have their offset shifted back.
>
> Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
>
> Regards.
>
> -- 
> Renaud Delbru,
> E.C.S., Ph.D. Student,
> Semantic Information Systems and
> Language Engineering Group (SmILE),
> Digital Enterprise Research Institute,
> National University of Ireland, Galway.
> http://smile.deri.ie/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message