lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Gershman <gregge...@yahoo.com>
Subject Help with mass delete from large index
Date Mon, 13 Feb 2006 14:47:04 GMT
I'm trying to delete a large number of documents
(~15million) from a a large index (30+ million
documents).  I've started with an optimized index, and
a list of docIds (our own unique identifier for a
document, not a Lucene doc number) to pass to the
IndexReader.delete(Term t) method.  I've had a few
different problems.

The following code is inside the loop that iterates
through the document IDs:

                  try {
                    Term t = new Term("docID",
String.valueOf(docID));
                   
deletedCount+=indexReader.delete(t);
                    }
                   catch (Exception e)
                   {
                       System.out.println("Error while
deleting docID#" + docID);
                       e.printStackTrace();
                   }

In order to commit the deletions, I also close and
reopen the IndexReader periodically.

At first I was reopening the IndexReader after every
500K documents deleted.  The problem was that after
~60-75K deletions, the delete call began to throw a
NullPointerException:

Error while deleting docID#27136356
java.lang.NullPointerException
        at java.lang.String.compareTo(String.java:402)
        at
org.apache.lucene.index.Term.compareTo(Term.java:76)
        at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
        at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:132)
        at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51)
        at
org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364)
        at
org.apache.lucene.index.IndexReader.delete(IndexReader.java:449)
        at IndexEraser.main(IndexEraser.java:32)

After a little fiddling around, I tried reducing the
interval between reopens to 5000, and most of the
NullPointerExceptions went away.

A test search of the resulting, unoptimized index
worked fine.

I then optimized the index to reduce the size of the
index.  Now, instead of getting data back for many of
the results, I get a null value.

Any ideas?  I'm really confused, and the only other
option I can think of is to reindex the documents I
need, which would take much longer than deleting the
ones I dont.

Thanks!

Greg Gershman

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message