Hi,
I would like to continuously iterate over the documents in my lucene index
as the index is updated. Kind of like a "stream" of documents. Is there a
way I can achieve this?
Would something like this be sufficient (untested):
int currentDocId = 0;
while(true) {
for(; currentDocId < reader.maxDoc(); currentDocId++) {
if(!reader.isDeleted(currentDocId)) {
Document d = reader.document(currentDocId);
}
}
// Maybe sleep here or something
IndexReader newReader = reader.reopen();
if(newReader != reader) {
reader.close();
reader = newReader;
}
}
Right now, I do some NLP on the index that would slow down my indexing if
done at the same time, so that is why I'm looking for a solution that works
in the background like this. Another concern I have is that starting from
scratch (fresh invocation of my program) requires me to load a lot of extra
data and then iterate through hundreds of thousands of documents just to get
to the newest docs that I haven't processed yet. I would rather just start
from the new newest doc and go forward.
I am currently checking whether or not I've processed a Document by looking
up a field in the Document in a Mongo db, but is there a way I could
reliably use the id of the document from the reader to check to see if I've
looked at this document already? I've heard that IndexReader.document() is
slow so I would like to skip that call if I know I've processed the document
already.
Any ideas?
Thanks,
Max
|