lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back
Date Tue, 06 Mar 2012 15:06:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223305#comment-13223305
] 

Michael McCandless commented on LUCENE-3854:
--------------------------------------------

OK I see the problem... it's not a bug, but is a looongstanding trap in Lucene: you cannot
retrieve a Document (from IR.document API) and expect it to accurately reflect what you had
indexed.  Information is lost, eg whether each field was tokenized or not, what the document
boost was, fields that were not stored are missing, etc.  In this particular case, IR.document
will enable "tokenized" for each text field it loads, which then causes the test failure.

This is a bad trap, since it tricks you into thinking you can load a stored document and reindex
it; instead, you have to re-create a new Document with the correct details on how it should
be indexed.

Really, IR.document should not even return a Document/Field.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to
show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED
and a value with a "-" in it. A TermQuery can find the value, initially, since the field is
not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document
that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now
the field gets tokenized, and the result is that the query on the term with the - in it no
longer works.
> So I think that the defect here is in the code that reconstructs the Document when read
from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can
also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message