lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Lynch <ihas...@gmail.com>
Subject Continuously iterate over documents in index
Date Tue, 13 Jul 2010 21:17:50 GMT
Hi,
I would like to continuously iterate over the documents in my lucene index
as the index is updated.  Kind of like a "stream" of documents.  Is there a
way I can achieve this?

Would something like this be sufficient (untested):

 int currentDocId = 0;
 while(true) {

     for(; currentDocId < reader.maxDoc(); currentDocId++) {

          if(!reader.isDeleted(currentDocId)) {
               Document d = reader.document(currentDocId);
          }
     }

     // Maybe sleep here or something

     IndexReader newReader = reader.reopen();
     if(newReader != reader) {
          reader.close();
          reader = newReader;
     }
}

Right now, I do some NLP  on the index that would slow down my indexing if
done at the same time, so that is why I'm looking for a solution that works
in the background like this.  Another concern I have is that starting from
scratch (fresh invocation of my program) requires me to load a lot of extra
data and then iterate through hundreds of thousands of documents just to get
to the newest docs that I haven't processed yet.  I would rather just start
from the new newest doc and go forward.

I am currently checking whether or not I've processed a Document by looking
up a field in the Document in a Mongo db, but is there a way I could
reliably use the id of the document from the reader to check to see if I've
looked at this document already?  I've heard that IndexReader.document() is
slow so I would like to skip that call if I know I've processed the document
already.

Any ideas?

Thanks,
Max

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message