lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: index bigger than it should be?
Date Mon, 31 Oct 2011 09:48:20 GMT
Do the individual docs get bigger after 28 million?  Can you try
loading the last few million docs, from when the size jumps, and see
what happens?  Or load them in reverse order or something, again to
see what happens?

I don't have indexes with that many docs, but I believe that plenty of
people do.


--
Ian.


On Sun, Oct 30, 2011 at 9:01 AM,  <v.sevel@lombardodier.com> wrote:
> Hi,
>
> I did the following on the existing index:
>  - expunge deletes
>  - optimize(5)
>  - check index
>
> then from the existing index I exported all docs into a new one, then on
> the new one I did:
>  - optimize(5)
>  - check index
>
> the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt
>
> during the export, I also monitored the size on disk at each chunk of
> 100000 docs added to the new index:
> http://dl.dropbox.com/u/47469698/lucene/index.xls
>
> what I found was that the index was taking around 2400 Mb/million docs
> almost all the time, and from time to time it would take a little bit more
> (<3500) during a short period of time. this stays true until around 28
> millions docs where the size on disk increases a lot (4500 Mb/million docs
> = 135 Gb on disk) until the end of the export (my index contains 32
> millions docs). at the end the space on disk went from 134 Gb to 91 Gb
> thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is
> still 3000 Mb/million docs, far more than the 2400 I was seeing most of
> the time.
>
> I understand that merges happen, what I was surprised about was that the
> behavior between 28 and 32 millions was a lot bigger in scale than the
> other merges before, and even an optimize would not solve this entirely.
> did I reach a limit? should I maintain the index at 25 millions to avoid
> this behavior?
>
> I am using lucene 3.4 with the tiered merge policy and all the fields are
> stored.
>
> thanks,
>
>
> Vincent Sevel
>
>
>
>
>
>
>
>
> Ian Lea <ian.lea@gmail.com>
> Sent by: java-user-return-51136-v.sevel=lombardodier.com@lucene.apache.org
>
>
> 27.10.2011 15:28
> Please respond to
> java-user@lucene.apache.org
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: index bigger than it should be?
>
>
>
>
>
>
> There's org.apache.lucene.index.CheckIndex which will report assorted
> stats about the index, as well as checking it for correctness.  It can
> fix it too but you don't need that.  I hope. Will take quite a while
> to run on a large index.
>
> What version of lucene?  Does a before/after (or large/small)
> directory listing give any clues?
>
>
> --
> Ian.
>
>
> On Thu, Oct 27, 2011 at 12:44 PM,  <v.sevel@lombardodier.com> wrote:
>> Hi,
>>
>> I have an application that has an index with 30 millions docs in it.
> every
>> day, I add around 1 million docs, and I remove the oldest 1 million, to
>> keepit stable at 30 million.
>> for the most part doc fields are indexed and stored. each doc weighs
>> around from a few Kb to a 1 Mb (a few Mb in some cases).
>> I used to be able to maintain the index at around 60 Gb on disk. but
>> recently the index has had a tendency to keep growing (90 Gb). I can see
>> that the expunge is doing what it should do, because after it executes,
>> the size on disk does go down, but never as low as the previous day.
> from
>> the outside, it looks like a leak, but since I do not remove the docs I
>> added during the day, it might be that the new docs are just bigger than
>> the old ones. still I am surprised with the increase.
>>
>> are there any tools to dig into the index structure and help justify the
>> space taken on disk?
>> I was thinking about something that would help identify terms that take
> up
>> the most space, or some sort of dump that I could compare from one day
> to
>> the other.
>>
>> any help appreciated,
>>
>> thanks,
>>
>> vince
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ************************ DISCLAIMER ************************
> This message is intended only for use by the person to
> whom it is addressed. It may contain information that is
> privileged and confidential. Its content does not
> constitute a formal commitment by Lombard Odier
> Darier Hentsch & Cie or any of its branches or affiliates.
> If you are not the intended recipient of this message,
> kindly notify the sender immediately and destroy this
> message. Thank You.
> *****************************************************************
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message