lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: More About NOT Optimizing
Date Wed, 07 Mar 2012 19:18:16 GMT
Maybe try TieredMergePolicy to see if it'd do any merges here...?

More responses below:

On Tue, Mar 6, 2012 at 8:00 PM, Paul Hill <> wrote:

> I have an index with 421163 documents (including body text)
> after running a test index for a couple of months with 3.4 code with the default LogByteSizeMergePolicy
(with everything at defaults: Merge Factor 10, MinMergeMB = 1.6, MaxMergeMB = 2048)
> And If I don't try a soon to become deprecated (in 3.5) expungeDeletes() (which does
reduce segments by two)
> And only try maybeMerge() which seems to leave everything in place.
> I end up with 33 segments with the following sizes in MB (grouped in 1s, 10s , 100s and
> 0.02,0.03,0.06,1.91,
> 11.19,12.76,15.89,15.98,21.35,24.67,25.61,25.63,30.11,30.90,31.55,31.66,32.52,33.22,36.11,37.14,43.37,
> 161.72,162.25,166.43,224.10,321.33,
> 2445.39,2679.24,2908.34,3727.49,3938.23,4044.89,5100.09
> (Note I got these values out of CheckIndex (no fix), so I have the documents and deleted
documents for all segments if we need to talk about those values).
> At first glance that looks like a sensible distribution, but if the Merge Factor is 10,
> why do I have 17 files in the 10-99 range?  Should I not just have 10?

Maybe turn on IW's infoStream?  That should give you some details
about how the segments were assigned to levels...

> The other problem is that I have two segments with lots of deleted documents, both with
plenty of deletes, but I don't see what I would do to tidy them up.
> Docs      MB         Deleted Docs
> 8158       321.33   3075
> 210989  5100.09 158456
> The 8158 doc segment is not really that interesting
> I'm assuming the biggest one is my original (over optimized) segment from months ago
when running 3.0.1 code (even though it has been upgraded to 3.4).
> It has lots of deleted documents, I assume this is taking up some space.

Log*MergePolicy by default will discount this segment's size according
to the pctg deleted, ie its size will look like 5100.09 *
(210989-158456) / 210989, which should in general cause the segment to
be merged sooner.

TieredMergePolicy is more aggressive about targeting deletes, I think.

> If I understand the algorithm correctly, the 1st time there is an opportunity to clean
up all that old stuff (even if it doesn't affect speed too much) is when
> 1.       There are so many new documents that this largest segment would be cleaned
up and combined into a larger 10,000 MB segment.  I'm not anticipating the end users generating
10-20x more files for a long time!
> 2.       There are so many deletes in this large segment that it could become part
of merging the 100MB segments into a newly merged 1000MB segment.   I don't anticipate the
end users replacing 90% of their original documents.

That's right (but look @ the infoStream output to see how many
segments are at the ~1000MB level)...

> Am I missing some feature of this algorithm or segments in general in which it takes
a shrinking large segment (many many deletes, as in this case) and combines it with the next
smaller size segments?
> What I'm looking at here is 1/3 of my index is deleted documents.  Need I not worry
about that at all ?  Is there no way to take the opportunity at some point to clean up the
large segment of the oldest documents?

Well, you can expunge, or make your own MP.  But hopefully TMP is more
aggressive (you can tune how aggressive it is)...

> Speaking of taking the opportunity to clean up. What happens if I change something in
my index, maybe a field storage or a norm calculation and I need to re-crawl everything?
> THEN MY INDEX WILL HAVE 100% REPLACES, so 50% of the index will be deleted documents.
 Is there something that would be nice to do to clean things up at that point?
> I think I'm willing to take the hit after I re-crawl, but it is not clear what that step
might be given the new API?  ExpungeDeletes seemed like a reasonable candidate, but it goes
away in 3.5.

Hmmm, expungeDeletes isn't going away; it's just being renamed to
forceMergeDeletes... but really you shouldn't have to call it.

> Am I missing some simple APIs or settings that I can use given one big old over-optimized
segment and alternatively something to do once I've done a major recrawl?

I think a good question is whether you are really seeing performance
issues due to the 1/3 deleted-but-not-yet-reclaimed documents...

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message