lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Incorrect Token Offset when using multiple fieldable instance
Date Wed, 05 Mar 2008 11:09:50 GMT

Well, first off, sometimes the thing being indexed isn't a string, so  
you have no stringValue to get its length.  It could be a Reader or a  
TokenStream.

Second off, it's conceivable that an analyzer computes its own  
"interesting" offsets that are not in fact simple indices into the  
stringValue, though I would expect that to be the exception not the  
rule.

I can't think of any other harm ... so if neither of these apply in  
your situation then it should be OK?

I do agree this seems like a bug.  EG, if you use Highlighter on a  
multi-valued field indexed with stored field & term vectors and say  
the first field ended with a stop word that was filtered out, then  
your offsets will be off and the wrong parts will be highlighted in  
all but the first field (I think?).  I think we really need some way  
for the tokenStream to "declare" its final offset at the end.

Mike

Renaud Delbru wrote:

> Do you know if there will be side-effects if we replace in  
> DocumentWriter$FieldData#invertField
> offset = offsetEnd+1;
> by
> offset = stringValue.length();
>
> I still not understand the reason of such choice for the  
> incrementation of the start offset.
>
> Regards.
>
> Michael McCandless wrote:
>>
>> This is how Lucene has worked for quite some time (since 1.9).
>>
>> When there are multiple fields with the same name in one Document,  
>> each field's offset starts from the last offset (offset of the  
>> last token) seen in the previous field.  If tokens are skipped at  
>> the end there's no way IndexWriter can know (because tokenStream  
>> doesn't return them).  It's as if we need the ability to query a  
>> tokenStream for its "final" offset or something.
>>
>> One workaround might be to insert an "end marker" token, with the  
>> true end offset, which is a term you would never search on?
>>
>> Mike
>>
>> Renaud Delbru wrote:
>>
>>> Hi,
>>>
>>> I currently use multiple fieldable instances for indexing  
>>> sentences of a document.
>>> When there is only one single fieldable instance, the token  
>>> offset generation performed in DocumentWriter is correct.
>>> The problem appears when there is two or more fieldable  
>>> instances. In DocumentWriter$FieldData#invertField method, if the  
>>> field is tokenized, instead of updating offset attribute with  
>>> stringValue.length() (which is performed if the field is not  
>>> tokenized, line 1458), you update the offset attribute with the  
>>> end offset of the last token (line 1503: offset = offsetEnd+1;).
>>> As a consequence, if a token has been filtered (for example a  
>>> stopword, a dot, a space, etc.), the offset attribute is updated  
>>> with the end offset of the last token not filtered. In this case,  
>>> you store inside the offset attribute an incorrect offset (the  
>>> offset is shift back) and all the next fieldable instances will  
>>> have their offset shifted back.
>>>
>>> Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
> -- 
> Renaud Delbru,
> E.C.S., Ph.D. Student,
> Semantic Information Systems and
> Language Engineering Group (SmILE),
> Digital Enterprise Research Institute,
> National University of Ireland, Galway.
> http://smile.deri.ie/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message