lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nigel <nigelspl...@gmail.com>
Subject Re: Efficient optimization of large indexes?
Date Thu, 06 Aug 2009 21:30:07 GMT
On Wed, Aug 5, 2009 at 3:50 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Wed, Aug 5, 2009 at 12:08 PM, Nigel<nigelspleen@gmail.com> wrote:
> > We periodically optimize large indexes (100 - 200gb) by calling
> > IndexWriter.optimize().  It takes a heck of a long time, and I'm
> wondering
> > if a more efficient solution might be the following:
> >
> > - Create a new empty index on a different filesystem
> > - Set a merge policy for the new index so it puts everything into one
> giant
> > segment (not sure how to do this off-hand, but I assume it's possible)
> > - Enumerate all documents in the unoptimized index and add them to the
> new
> > index
>
> Actually IndexWriter must periodically flush, which will always
> create new segments, which will then always require merging.  Ie
> there's no way to just add everything to only one segment in one
> shot.
>

Hmm, that makes sense now that you mention it.  And if you have to merge in
the end anyway, there's no point to my idea of adding docs to a new index.

But addIndexes(IndexReader[]) as you suggest would solve that problem.


> Merge performance does seem rather slow... I recently profiled it and
> was suprised to find that the merging of terms dict & postings was cpu
> bound, even on a modern CPU (core i7 920) and with 3 merges running
> concurrently.  I think most of the CPU cost comes from the pqueue
> that's used to do the merge sort, plus read/writeVInt.  When Lucene
> [eventually] switches to PForDelta, that should be more CPU friendly.


That's interesting.  I recently did one of our big merges on a different
server that has the same disks as the one I was using before, but a faster
processor.  It seemed like the merge was quite a bit faster than usual
(though it's possible I was fooled by other factors).

Also, it's tons of IO because for each merge it must read every single
> byte and write nearly every single byte, so that's ~2X bytes moved.
> Then, if you have more segments in your index than your mergeFactor,
> multiple such merges are needed and you're looking at, at least, 4X
> your index size in net bytes moved.  If you have CFS enabled, it's 8X
> the index size.


Not to get too sidetracked, but this reminds me of another question I meant
to ask.  We use the compound format right now.  Our merge factor is
relatively low, so switching to the non-compound format would certainly be
possible without running into problems with too many open files.  Is there
any significant speed different between compound and non-compound for
indexing, searching, or merging?  (Searching for us would be the most
important by far.)


>   * If possible, make sure you always add the same fields to your
>    docs, in the same order (this results in consistent numbering of
>    field name -> number).  This is very much an unexpected
>    gotchya... the merging of stored fields and term vectors is much,
>    much faster if the field numbers are identical.  LUCENE-1737 is
>    open to fix Lucene so it consistently numbers automatically, but
>    it's somewhat tricky because many places in Lucene assume the
>    field names are densely packed.
>

We generally do this already, but some of our fields are nullable and so for
some documents the number-to-name mapping will be different.  Is there any
value in adding dummy values like "NULL" in these cases?  That presumably
adds overhead of its own, though.

Thanks,
Chris

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message