lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wojtek H" <>
Subject Re: The best way to iterate over document
Date Wed, 26 Mar 2008 15:29:06 GMT
Thank you for reply. What I did not mention before was that for
iteration we don't care about scoring, so that's not the issue at all.
Creating Filter with BitSet seems much better idea than keeping
HitIterator in memory. Am I right that in such a case with
MatchAllDocsQuery memory usage would be around
( NUM_OF_DOCS_IN_INDEX / 8 )   bytes?
We didn't check it yet, but do you think that time for accessing
documents (reader.doc(i)) is big enough to make iteration in
HitCollector (without accessing any objects) over documents
already returned almost un-noticable?
And another question - if I don't care about scoring is there a way to
make Lucene don't spend time on calculating score (I don't know if
that time matters)? HitCollector receives doc and its score (as far as
I remember the difference here is that it is not normalized to value
between 0 and 1). Is there a way (and does it make sense) to make
scoring faster in such a case?
And to make things clear - am I right that if I operate on the same
searcher over requests for docs chunks I don't see neither additions
nor deletions which could happen meanwhile? So if I would like to
iterate over point-in-time keeping the same searcher opened would do.
Thanks and regards,

2008/3/26, Erick Erickson <>:
> Why not keep a Filter in memory? It consists of a single bit per document
>  and the ordinal position of that bit is the Lucene doc ID. You could create
>  this reasonably quickly for the *first* query that came in via HitCollector.
>  Then each time you wanted another chunk, use the filter to know which
>  docs to return. You could either, say, extend the Filter class and add
>  some bookeeping or just zero out each bit that you returned to the user.
>  NOTE: you don't get relevance this way, but for the case of returning all
>  docs do you really want it?
>  About updating the index. Remember that there is no "update in place".
>  So you'll only have to check whether any document in the filter has been
>  deleted when you are returning. Then you'd have to do something about
>  looking for any new additions as you returned the last document in the
>  set...
>  But remember that until you close/reopen the searcher, you won't see changes
>  anyway.....
>  But you may not need to do any of this. If, each time you return a chunk,
>  you're using a Hits object, then this is the first thing I'd change. A Hits
>  object re-executes the query every 100th element you look at. So, assume
>  you have something like
>  (bad pseudo code here)
>  for (int idx = 0; idx < firstdocinchunk &&; ++idx)
>  {
>  }
>  for (idx = 0; idx < chunksize &&; ++idx)
>  {
>     assemble doc for return
>  }
>  and the first doc you want to return is number 1,000, you'll actually
>  be re-executing the query 10 times. Which probably accounts for your
>  quadratic time.
>  So I'd try just using a new HitCollector each time and see if that solves
>  your problems before getting fancy. There really shouldn't be any
>  noticeable difference between the first and last request unless you're
>  doing something like accessing the documents before you get to
>  the first one you expect to return. And a TopDocs should even
>  preserve scoring.......
>  Best
> Erick
>  On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <> wrote:
>  > Hi all,
>  >
>  > our problem is to choose the best (the fastest) way to iterate over huge
>  > set
>  > of documents (basic and most important case is to iterate over all
>  > documents
>  > in the index). Some slow process accesses documents and now it is done via
>  > repeating query (for instance MatchAllDocsQuery). It processess first N
>  > docs
>  > then repeats query and processes next N docs and so on. Repeating query
>  > means in fact quadratic time! So we think about changing the way docs are
>  > accessed.
>  > In case of generic query the only way to speed it up we see is to keep
>  > HitCollector in memory between requests for docs. Isn't this approach too
>  > memory consuming?
>  > In case of iterating over all documents I was wondering if there is a way
>  > to
>  > determine set of index ids over which we could iterate (and of course
>  > control index changes - if index is changed between requests we should
>  > probably invalidate 'iterating session').
>  > What is the best solution for this problem?
>  > Thanks and regards,
>  >
>  > wojtek
>  >

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message