lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <>
Subject RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Fri, 23 Mar 2007 02:28:37 GMT
> But when these values have
> changed (or, segments were flushed by RAM not by maxBufferedDocs) then
> the way it computes level no longer results in the logarithmic policy
> that it's trying to implement, I think.

That's right. Parts of the implementation assume that the segments are
logarithmically related and if they aren't, the result is kind of
undefined, in the sense that it does whatever it does. It doesn't try to
restore the logarithmic invariant.

In some cases, where we know we're combining multiple indexes, we try to
restore the invariant, but it's a little weakly defined what that
invariant is, so the results aren't canonically defined.

> Though now I have to go ponder
> KS's Fibonacci series approach!.

Yeah, me too. I roughed out a version using the factored merge policy
... not that that means I completely understand the implications. I
still have to teach IndexWriter how to deal with non-contiguous merge
specs, which I didn't originally anticipate.

But a very interesting exercise.

> I think for high turnover content the way old segments will be mostly
> deleted but at some point the merge will fully cascade (basically
> an optimize) and they will be cleaned up?

Well, as it stands, the only way to get rid of deleted docs at level n
is either to merge them into a level n+1 segment or optimize. As n gets
big, the chances of getting to a level n+1 segment go down. So those
level n segments live a long time, a manual optimize not withstanding:
hence my question: Do we see big segments that are mostly sparse? And a
merge there is one of those kinda bad things: if you have to level n
segments, one pretty sparse and the next fairly full, you have to merge
left to right, copying all of the full segment just to recover the
deleted space in the sparse segment.

> Hmmm.  What cases lead to this?

Well, it's kinda artificial, but combining indexes will do this. They're
combined in series, so the little segments of one index are followed by
the big segments of the next. So if you need to do some merging, and
given the current ordering constraint, you can merge a bunch of little
segments with one big segment to create only a slightly bigger segment.
Similarly with lots of deleted documents ...

But I don't know if these cases are observed in the wild ...

> The only thing I wonder
> is whether reading from say 20 segments at once is slower (more
> than 2X slower) than reading from 10?  Or it could actually be
> faster (since OS/disk heads can better schedule IO access to a
> larger number of spots)?  I don't know.

Me neither. I can imagine that order of magnitude differences (binary or
otherwise) would show changes but that small changes like +/- one
wouldn't, so that we might want to give ourselves that freedom.

Should be pretty easy to experiment with this stuff with the factored
merge, which I'll post Real Soon Now.

I haven't tried any performance testing. I can't remember: I need to
look and see if the performance framework measure indexing as well as

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message