lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3838) IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
Date Sun, 04 Mar 2012 10:35:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221842#comment-13221842
] 

Michael McCandless commented on LUCENE-3838:
--------------------------------------------

Lucene's maybeMerge, even in 3.1.0, will merge away deleted documents; I'm not sure why you
don't see that happening.

Really, when Lucene reclaims deletions and renumbers its documents, is an internal implementation
detail.  Applications should not rely on this behavior.  Can you add your own ID field to
the index?  Or, alternatively, never delete documents but instead use a filter in the application
to skip the documents.  Or, in 4.0 (trunk), you could perhaps make a custom codec that "pretends"
there are no deletions when merging runs...
                
> IndexWriter.maybeMerge() removes deleted documents from index (Lucene 3.1.0 to 3.5.0)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3838
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5
>         Environment: Windows, Linux, OSX
>            Reporter: Ivan Stojanovic
>            Priority: Blocker
>              Labels: api-change
>         Attachments: TempTest.java
>
>
> My company uses Lucene for high performance, heavy loaded farms of translation repositories
with hundreds of simultaneous add/delete/update/search/retrieve threads. In order to support
this complex architecture beside other things and tricks used here I rely on docId-s being
unchanged until I ask that explicitly (using IndexWriter.optimize() - IndexWriter.forceMerge()).
> For this behavior LogMergePolicy is used.
> This worked fine until we raised the Lucene version from 3.0.2 to 3.5.0. Until version
3.1.0 merge triggerred by IndexWriter.addDocument() didn't expunge deleted documents ensuring
that docId-s stayed unchanged and making some critical jobs possible without impact on index
size. IndexWriter.optimize() did the actual deleted documents removal.
> From Lucene version 3.1.0 IndexWriter.maybeMerge() does the same thing as IndexWriter.forceMerge()
regarding deleted documents. There is no difference. This leads to unpredictable internal
index structure changes during simple document add (and possible delete) operations and in
undefined point in time. I looked into the Lucene source code and can definitely confirm this.
> This issue makes our Lucene client code totally unusable.
> Solution steps:
> 1) add a flag somewhere that will control whether the deleted documents should be removed
in maybeMerge(). Note that this is only a half of what we need here.
> 2) make forceMerge() always remove deleted documents no matter if maybeMerge() removes
them or not. Alternatively, there can be another parameter added to forceMerge() that will
also tell if deleted documents should be removed from index or not.
> The sample JUnit code that can replicate this issue is added below.
> public class TempTest {
>     private Analyzer _analyzer = new KeywordAnalyzer();
>     @Test
>     public void testIndex() throws Exception {
> 	File indexDir = new File("sample-index");
> 	if (indexDir.exists()) {
> 	    indexDir.delete();
> 	}
> 	FSDirectory index = FSDirectory.open(indexDir);
> 	Document doc;
> 	IndexWriter writer = createWriter(index, true);
> 	try {
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text0", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text1", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    doc = new Document();
> 	    doc.add(new Field("field", "text2", Field.Store.YES,
> 		    Field.Index.ANALYZED));
> 	    writer.addDocument(doc);
> 	    writer.commit();
> 	} finally {
> 	    writer.close();
> 	}
> 	IndexReader reader = IndexReader.open(index, false);
> 	try {
> 	    reader.deleteDocument(1);
> 	} finally {
> 	    reader.close();
> 	}
> 	writer = createWriter(index, false);
> 	try {
> 	    for (int i = 3; i < 100; i++) {
> 		doc = new Document();
> 		doc.add(new Field("field", "text" + i, Field.Store.YES,
> 			Field.Index.ANALYZED));
> 		writer.addDocument(doc);
> 		writer.commit();
> 	    }
> 	} finally {
> 	    writer.close();
> 	}
> 	boolean deleted;
> 	String text;
> 	reader = IndexReader.open(index, true);
> 	try {
> 	    deleted = reader.isDeleted(1);
> 	    text = reader.document(1).get("field");
> 	} finally {
> 	    reader.close();
> 	}
> 	assertTrue(deleted); // This line breaks
> 	assertEquals("text1", text);
>     }
>     private MergePolicy createEngineMergePolicy() {
> 	LogDocMergePolicy mergePolicy = new LogDocMergePolicy();
> 	mergePolicy.setCalibrateSizeByDeletes(false);
> 	mergePolicy.setUseCompoundFile(true);
> 	mergePolicy.setNoCFSRatio(1.0);
> 	return mergePolicy;
>     }
>     private IndexWriter createWriter(Directory index, boolean create)
> 	    throws Exception {
> 	IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_35,
> 		_analyzer);
> 	iwConfig.setOpenMode(create ? IndexWriterConfig.OpenMode.CREATE
> 		: IndexWriterConfig.OpenMode.APPEND);
> 	iwConfig.setMergePolicy(createEngineMergePolicy());
> 	iwConfig.setMergeScheduler(new ConcurrentMergeScheduler());
> 	return new IndexWriter(index, iwConfig);
>     }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message