lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hill <p...@metajure.com>
Subject More About NOT Optimizing
Date Wed, 07 Mar 2012 01:00:45 GMT
I'm running with 3.4 code and have studied up on all the API related to the optimize() replacements
and understand I needn't worry about deleted documents, but I still want to ask a few things
about keeping the index in good shape
And about merge policy.

I have an index with 421163 documents (including body text)
after running a test index for a couple of months with 3.4 code with the default LogByteSizeMergePolicy
(with everything at defaults: Merge Factor 10, MinMergeMB = 1.6, MaxMergeMB = 2048)
And If I don't try a soon to become deprecated (in 3.5) expungeDeletes() (which does reduce
segments by two)
And only try maybeMerge() which seems to leave everything in place.
I end up with 33 segments with the following sizes in MB (grouped in 1s, 10s , 100s and 1000s
0.02,0.03,0.06,1.91,
11.19,12.76,15.89,15.98,21.35,24.67,25.61,25.63,30.11,30.90,31.55,31.66,32.52,33.22,36.11,37.14,43.37,
161.72,162.25,166.43,224.10,321.33,
2445.39,2679.24,2908.34,3727.49,3938.23,4044.89,5100.09
(Note I got these values out of CheckIndex (no fix), so I have the documents and deleted documents
for all segments if we need to talk about those values).

At first glance that looks like a sensible distribution, but if the Merge Factor is 10,
why do I have 17 files in the 10-99 range?  Should I not just have 10?

The other problem is that I have two segments with lots of deleted documents, both with plenty
of deletes, but I don't see what I would do to tidy them up.
Docs      MB         Deleted Docs
8158       321.33   3075
210989  5100.09 158456
The 8158 doc segment is not really that interesting

I'm assuming the biggest one is my original (over optimized) segment from months ago when
running 3.0.1 code (even though it has been upgraded to 3.4).
It has lots of deleted documents, I assume this is taking up some space.

If I understand the algorithm correctly, the 1st time there is an opportunity to clean up
all that old stuff (even if it doesn't affect speed too much) is when

1.       There are so many new documents that this largest segment would be cleaned up and
combined into a larger 10,000 MB segment.  I'm not anticipating the end users generating 10-20x
more files for a long time!

2.       There are so many deletes in this large segment that it could become part of merging
the 100MB segments into a newly merged 1000MB segment.   I don't anticipate the end users
replacing 90% of their original documents.


Am I missing some feature of this algorithm or segments in general in which it takes a shrinking
large segment (many many deletes, as in this case) and combines it with the next smaller size
segments?
What I'm looking at here is 1/3 of my index is deleted documents.  Need I not worry about
that at all ?  Is there no way to take the opportunity at some point to clean up the large
segment of the oldest documents?

Speaking of taking the opportunity to clean up. What happens if I change something in my index,
maybe a field storage or a norm calculation and I need to re-crawl everything?
THEN MY INDEX WILL HAVE 100% REPLACES, so 50% of the index will be deleted documents.  Is
there something that would be nice to do to clean things up at that point?
I think I'm willing to take the hit after I re-crawl, but it is not clear what that step might
be given the new API?  ExpungeDeletes seemed like a reasonable candidate, but it goes away
in 3.5.
Am I missing some simple APIs or settings that I can use given one big old over-optimized
segment and alternatively something to do once I've done a major recrawl?

-Paul







































































Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message