lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Austin, Carl" <Carl.Aus...@baesystemsdetica.com>
Subject RE: Why is the old value still in the index
Date Fri, 16 Dec 2011 17:32:13 GMT
The .docFreq() call returns the number of documents that the current
term in the enum is in, not all terms in the term enum.

Also be aware of, from the lucene wiki : "Once a document is deleted it
will not appear in TermDocs nor TermPositions enumerations, nor any
search results. Attempts to load the document will result in an
exception. The presence of this document may still be reflected in the
docFreq statistics, and thus alter search scores, though this will be
corrected eventually as segments containing deletions are merged."

You can check more accurately by using the TermDocs if you need to.

-----Original Message-----
From: Paul Taylor [mailto:paul_t100@fastmail.fm] 
Sent: 16 December 2011 17:20
To: Ian Lea
Cc: java-user@lucene.apache.org
Subject: Re: Why is the old value still in the index

On 16/12/2011 17:10, Ian Lea wrote:
> Shouldn't
>
> iw.updateDocument(new Term(FIELD1,"term1"),document);
>
> be
>
> iw.updateDocument(new Term(FIELD1,"test"),document);
>
> if you want to replace the first doc?
Hmm, you are right if I change it I then get

TermDocsFreq1
test
TermDocsFreq1
test2



(but doesn't resolve the program with my real code that doesnt seem to 
have this mistake :()

What I dont understand then is in the incorrect example why don't I get

TermDocsFreq2


if Ive actually create another document rather than updating one ?

-- Ian. On Fri, Dec 16, 2011 at 4:54 PM, Paul Taylor 
<paul_t100@fastmail.fm> wrote:
>> I'm adding documents to an index, at a later date I modify a document
and
>> update the index, close the writer and open a new IndexReader. My
>> indexreader iterates over terms for that field and docFreq() returns
one as
>> I would expect, however the iterator  returns both the old value of
the
>> document and the new value, I don't expect (or want) the old value to
still
>> be in the index, so why is this.
>>
>>
>> This full test program generates:
>>
>> TermDocsFreq1
>> test
>> TermDocsFreq1
>> test
>> test2
>>
>> Dont expect to see 'test' listed the second time
>>
>>
>> package com.jthink.jaikoz;
>>
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.index.*;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.util.Version;
>>
>>
>> public class LuceneTest
>> {
>>     public  static void main(String []args)
>>     {
>>         try
>>         {
>>             String FIELD1="field1";
>>             RAMDirectory dir = new RAMDirectory();
>>             IndexWriterConfig iwc = new
IndexWriterConfig(Version.LUCENE_35,
>> new StandardAnalyzer(Version.LUCENE_35));
>>             IndexWriter       iw  = new IndexWriter(dir, iwc);
>>             Document document = new Document();
>>             document.add(new Field(FIELD1,"test", Field.Store.YES,
>> Field.Index.ANALYZED));
>>             iw.addDocument(document);
>>             iw.close();
>>
>>             IndexReader ir = IndexReader.open(dir,true);
>>             TermEnum terms = ir.terms(new Term(FIELD1));
>>             System.out.println("TermDocsFreq"+terms.docFreq());
>>             do
>>             {
>>                 if (terms.term() != null)
>>                 {
>>                     System.out.println(terms.term().text());
>>                 }
>>             }
>>             while (terms.next()&&
terms.term().field().equals(FIELD1));
>>
>>             IndexWriterConfig iwc2 = new
IndexWriterConfig(Version.LUCENE_35,
>> new StandardAnalyzer(Version.LUCENE_35));
>>             iw  = new IndexWriter(dir, iwc2);
>>             document = new Document();
>>             document.add(new Field(FIELD1,"test2", Field.Store.YES,
>> Field.Index.ANALYZED));
>>             iw.updateDocument(new Term(FIELD1,"term1"),document);
>>             iw.close();
>>
>>             ir = IndexReader.open(dir,true);
>>             terms = ir.terms(new Term(FIELD1));
>>             System.out.println("TermDocsFreq"+terms.docFreq());
>>             do
>>             {
>>                 if (terms.term() != null)
>>                 {
>>                     System.out.println(terms.term().text());
>>                 }
>>             }
>>             while (terms.next()&&
terms.term().field().equals(FIELD1));
>>         }
>>         catch(Exception ex)
>>         {
>>             ex.printStackTrace();
>>         }
>>     }
>>
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Please consider the environment before printing this email.
 
This message should be regarded as confidential. If you have received this email in error
please notify the sender and destroy it immediately.
 
Statements of intent shall only become binding when confirmed in hard copy by an authorised
signatory. 
 
The contents of this email may relate to dealings with other companies under the control of
BAE Systems plc details of which can be found at http://www.baesystems.com/Businesses/index.htm.
 
Detica Limited is a BAE Systems company trading as BAE Systems Detica.
Detica Limited is registered in England and Wales under No: 1337451.
Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message