lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <sk...@sloan.mit.edu>
Subject Re: Continuously iterate over documents in index
Date Tue, 13 Jul 2010 23:12:25 GMT
On Tue, Jul 13, 2010 at 5:17 PM, Max Lynch <ihasmax@gmail.com> wrote:
> Hi,
> I would like to continuously iterate over the documents in my lucene index
> as the index is updated.  Kind of like a "stream" of documents.  Is there a
> way I can achieve this?
>
> Would something like this be sufficient (untested):
>
>  int currentDocId = 0;
>  while(true) {
>
>     for(; currentDocId < reader.maxDoc(); currentDocId++) {
>
>          if(!reader.isDeleted(currentDocId)) {
>               Document d = reader.document(currentDocId);
>          }
>     }
>
>     // Maybe sleep here or something
>
>     IndexReader newReader = reader.reopen();
>     if(newReader != reader) {
>          reader.close();
>          reader = newReader;
>     }
> }


Looks ok,

>
> Right now, I do some NLP  on the index that would slow down my indexing if
> done at the same time, so that is why I'm looking for a solution that works
> in the background like this.  Another concern I have is that starting from
> scratch (fresh invocation of my program) requires me to load a lot of extra
> data and then iterate through hundreds of thousands of documents just to get
> to the newest docs that I haven't processed yet.  I would rather just start
> from the new newest doc and go forward.
>
> I am currently checking whether or not I've processed a Document by looking
> up a field in the Document in a Mongo db, but is there a way I could
> reliably use the id of the document from the reader to check to see if I've
> looked at this document already?  I've heard that IndexReader.document() is
> slow so I would like to skip that call if I know I've processed the document
> already.


You could have a field within each doc say "Processed" and store a
value Yes/No, next run a searcher query which should give you the
collection of unprocessed ones.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message