lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Devine <alex.dev...@gmail.com>
Subject Possible to do an indexorder sort over a MultiSearcher?
Date Tue, 25 Oct 2011 19:24:42 GMT
Hi all,

I'm an trying to provide a way to efficiently allow a client to page over
all of the documents in multiple Lucene indexes that I'm querying with a
MultiSearcher (~1-2 million docs). Unfortunately, I can't use the standard
paging algorithm of getting TopDocs to the last record needed and then
skipping all of the preceding pages because the queries get extremely slow
and memory usage becomes prohibitive as the client requests higher and
higher page numbers.

Thus, my workaround for this was to run a search using an indexorder sort
(that is, sort by document ID), and then the client could page over the
results by running a query that says "get me all the documents where the doc
ID is greater than the last doc ID of the previous page". This way the
client only ever asks for a TopDocs the size of a single page, but the
client can still run forward to eventually get all the documents in the
index.

While this works when searching over a single IndexReader, it fails when
using a MultiSearcher for 2 reasons:
1. Sorting by docId doesn't really work in a MultiSearcher because of the
way the searcher munges the IDs. For example, if there are 2 indexes each
with 3 docs #1, #2 and #3, the MultiSearcher will return results that look
like "1,4,2,5,3,6".
2. The "MinimumDocIdQuery" I wrote only works when you pass it the ORIGINAL
doc ID that is local to the index reader, not the one that was munged by
the MultiSearcher.

Does anyone have any advice to work around this? I was thinking if I could
somehow get the "local" document ID back from the MultiSearcher that would
work, as I could return that with my search results (and sorted by that ID
things would look good, e.g. "1, 1, 2, 2, 3, 3"). If anyone has some advice
on how to better solve my original problem, that is, being able to run over
all of the documents in a potentially very large index using time and memory
efficient paging, that would also be appreciated.

Thanks,
Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message