lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From v.se...@lombardodier.com
Subject Re: index bigger than it should be?
Date Sun, 30 Oct 2011 09:01:05 GMT
Hi,

I did the following on the existing index:
 - expunge deletes
 - optimize(5)
 - check index

then from the existing index I exported all docs into a new one, then on 
the new one I did:
 - optimize(5)
 - check index

the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt

during the export, I also monitored the size on disk at each chunk of 
100000 docs added to the new index:
http://dl.dropbox.com/u/47469698/lucene/index.xls

what I found was that the index was taking around 2400 Mb/million docs 
almost all the time, and from time to time it would take a little bit more 
(<3500) during a short period of time. this stays true until around 28 
millions docs where the size on disk increases a lot (4500 Mb/million docs 
= 135 Gb on disk) until the end of the export (my index contains 32 
millions docs). at the end the space on disk went from 134 Gb to 91 Gb 
thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is 
still 3000 Mb/million docs, far more than the 2400 I was seeing most of 
the time.

I understand that merges happen, what I was surprised about was that the 
behavior between 28 and 32 millions was a lot bigger in scale than the 
other merges before, and even an optimize would not solve this entirely.
did I reach a limit? should I maintain the index at 25 millions to avoid 
this behavior?

I am using lucene 3.4 with the tiered merge policy and all the fields are 
stored.

thanks,


Vincent Sevel








Ian Lea <ian.lea@gmail.com> 
Sent by: java-user-return-51136-v.sevel=lombardodier.com@lucene.apache.org
 
 
27.10.2011 15:28
Please respond to
java-user@lucene.apache.org



To
java-user@lucene.apache.org
cc

Subject
Re: index bigger than it should be?






There's org.apache.lucene.index.CheckIndex which will report assorted
stats about the index, as well as checking it for correctness.  It can
fix it too but you don't need that.  I hope. Will take quite a while
to run on a large index.

What version of lucene?  Does a before/after (or large/small)
directory listing give any clues?


--
Ian.


On Thu, Oct 27, 2011 at 12:44 PM,  <v.sevel@lombardodier.com> wrote:
> Hi,
>
> I have an application that has an index with 30 millions docs in it. 
every
> day, I add around 1 million docs, and I remove the oldest 1 million, to
> keepit stable at 30 million.
> for the most part doc fields are indexed and stored. each doc weighs
> around from a few Kb to a 1 Mb (a few Mb in some cases).
> I used to be able to maintain the index at around 60 Gb on disk. but
> recently the index has had a tendency to keep growing (90 Gb). I can see
> that the expunge is doing what it should do, because after it executes,
> the size on disk does go down, but never as low as the previous day. 
from
> the outside, it looks like a leak, but since I do not remove the docs I
> added during the day, it might be that the new docs are just bigger than
> the old ones. still I am surprised with the increase.
>
> are there any tools to dig into the index structure and help justify the
> space taken on disk?
> I was thinking about something that would help identify terms that take 
up
> the most space, or some sort of dump that I could compare from one day 
to
> the other.
>
> any help appreciated,
>
> thanks,
>
> vince

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message