lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: optimize with num segments > 1 index keeps growing
Date Mon, 12 Sep 2011 16:18:54 GMT
Hmm... are you using IndexReader.numDeletedDocs to check?

Did you commit from the writer and then reopen the IndexReader before
calling .numDeletedDocs?  Else the reader won't see the change.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Sep 10, 2011 at 11:58 PM,  <v.sevel@lombardodier.com> wrote:
> Hi, even with setExpungeDeletesPctAllowed(0.0), I could not get docs to
> get removed from disk.
> after the expunge+commit I print again the numDeletedDocs, and it stays
> the same.
> regards,
> vincent
>
>
>
>
>
>
>
>
>
>
> Michael McCandless <lucene@mikemccandless.com>
>
>
> 09.09.2011 20:53
> Please respond to
> java-user@lucene.apache.org
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: optimize with num segments > 1 index keeps growing
>
>
>
>
>
>
> TieredMergePolicy by default will only merge a segment if it has > 10%
> deletions.
>
> Can you try calling .setExpungeDeletesPctAllowed(0.0) and then expunge
> again?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Sep 9, 2011 at 1:41 PM,  <v.sevel@lombardodier.com> wrote:
>> Hi,
>>
>> this post is quite old, but I would like to share some recen
> developments.
>>
>> I applied the recommandation. my process became: expunge deletes and
>> optimize 2 segments.
>>
>> at the time I was with lucene 3.1 and that solved my issue. recently I
>> moved to lucene 3.3, and I tried playing with the new tiered merge
> policy.
>> what I found was that after an expunge, the number of deleted docs would
>> stay the same, and space would not be reclaimed on the disk. I switched
>> back to the default merge policy (LogByteSizeMergePolicy:
>> minMergeSize=1677721, mergeFactor=10, maxMergeSize=2147483648,
>> maxMergeSizeForOptimize=9223372036854775807,
> calibrateSizeByDeletes=true,
>> maxMergeDocs=2147483647, useCompoundFile=true, noCFSRatio=0.1) and got
>> this time the right behavior : size was reclaimed on disk. I even tried
>> with the BalancedSegmentMergePolicy and got again the right behavior.
>>
>> so this issue seems to affect only the tiered merge policy.
>>
>> to illustrate this, I took an index with many deleted docs then
>> expunged/optimized while using the tiered policy, then did the same
> thing
>> with a default merge policy. here is for each step the content of the
>> directory:
>>
>> before:
>>
>> 09.09.2011  17:38                20 segments.gen
>> 09.09.2011  17:38             5'335 segments_4bf1u
>> 06.09.2011  15:27                 0 write.lock
>> 06.09.2011  00:49    31'681'157'794 _jhwld.fdt
>> 06.09.2011  00:49       115'562'268 _jhwld.fdx
>> 06.09.2011  00:37             5'347 _jhwld.fnm
>> 06.09.2011  01:13     7'147'947'472 _jhwld.frq
>> 06.09.2011  01:13     3'927'649'164 _jhwld.prx
>> 06.09.2011  01:13        41'992'760 _jhwld.tii
>> 06.09.2011  01:13     3'745'729'056 _jhwld.tis
>> 09.09.2011  00:27         1'805'669 _jhwld_3.del
>> 09.09.2011  00:31    11'397'619'448 _jtrwg.fdt
>> 09.09.2011  00:31        98'393'316 _jtrwg.fdx
>> 09.09.2011  00:27             5'347 _jtrwg.fnm
>> 09.09.2011  00:47     5'146'273'732 _jtrwg.frq
>> 09.09.2011  00:47     1'661'436'146 _jtrwg.prx
>> 09.09.2011  00:47        23'950'194 _jtrwg.tii
>> 09.09.2011  00:47     2'139'903'139 _jtrwg.tis
>> 09.09.2011  07:39        94'471'867 _jugaa.cfs
>> 09.09.2011  10:14       252'716'611 _juok2.cfs
>> 09.09.2011  15:45         7'986'102 _jwuaq.cfs
>> 09.09.2011  16:00         5'780'703 _jx45g.cfs
>> 09.09.2011  16:00       333'981'384 _jx46a.cfs
>> 09.09.2011  16:23        20'955'761 _jxge0.cfs
>> 09.09.2011  16:46        19'258'025 _jxmas.cfs
>> 09.09.2011  16:55        16'622'800 _jxpv4.cfs
>> 09.09.2011  17:10        14'605'028 _jxvd6.cfs
>> 09.09.2011  17:34        12'456'476 _jy28o.cfs
>> 09.09.2011  17:38         2'584'950 _jy91y.cfs
>> 09.09.2011  17:38         2'595'049 _jy92i.cfs
>> 09.09.2011  17:38         2'600'991 _jy932.cfs
>> 09.09.2011  17:38         2'610'278 _jy93m.cfs
>> 09.09.2011  17:38            46'664 _jy93x.cfs
>> 09.09.2011  17:38             9'765 _jy93y.cfs
>> 09.09.2011  17:38            10'691 _jy93z.cfs
>> 09.09.2011  17:38             9'533 _jy940.cfs
>> 09.09.2011  17:38            11'684 _jy941.cfs
>> 09.09.2011  17:38             8'996 _jy942.cfs
>>              38 File(s) 67'918'759'565 bytes
>>
>>
>> after expunge/optimize (tiered merge policy):
>>
>> 09.09.2011  18:02                20 segments.gen
>> 09.09.2011  18:02             3'171 segments_4bf3g
>> 06.09.2011  15:27                 0 write.lock
>> 06.09.2011  00:49    31'681'157'794 _jhwld.fdt
>> 06.09.2011  00:49       115'562'268 _jhwld.fdx
>> 06.09.2011  00:37             5'347 _jhwld.fnm
>> 06.09.2011  01:13     7'147'947'472 _jhwld.frq
>> 06.09.2011  01:13     3'927'649'164 _jhwld.prx
>> 06.09.2011  01:13        41'992'760 _jhwld.tii
>> 06.09.2011  01:13     3'745'729'056 _jhwld.tis
>> 09.09.2011  17:39         1'805'669 _jhwld_4.del
>> 09.09.2011  17:45    11'814'367'373 _jy9iy.fdt
>> 09.09.2011  17:45       101'565'036 _jy9iy.fdx
>> 09.09.2011  17:39             5'347 _jy9iy.fnm
>> 09.09.2011  18:01     5'328'530'169 _jy9iy.frq
>> 09.09.2011  18:01     1'733'490'572 _jy9iy.prx
>> 09.09.2011  18:01        25'072'713 _jy9iy.tii
>> 09.09.2011  18:01     2'239'702'399 _jy9iy.tis
>> 09.09.2011  18:02           185'962 _jy9mv.cfs
>> 09.09.2011  18:02             9'955 _jy9mw.cfs
>> 09.09.2011  18:02            10'380 _jy9mx.cfs
>> 09.09.2011  18:02             9'341 _jy9my.cfs
>> 09.09.2011  18:02             9'228 _jy9mz.cfs
>> 09.09.2011  18:02            10'382 _jy9n0.cfs
>> 09.09.2011  18:02             9'345 _jy9n1.cfs
>> 09.09.2011  18:02             9'231 _jy9n2.cfs
>> 09.09.2011  18:02             8'961 _jy9n3.cfs
>> 09.09.2011  18:02            10'381 _jy9n4.cfs
>> 09.09.2011  18:02           199'651 _jy9n5.cfs
>> 09.09.2011  18:02             9'345 _jy9n6.cfs
>> 09.09.2011  18:02             9'230 _jy9n7.cfs
>>              31 File(s) 67'905'077'722 bytes
>>
>> after expungeDeletes/optimize with default merge policy :
>>
>> 09.09.2011  19:31                20 segments.gen
>> 09.09.2011  19:31             2'081 segments_4bfpe
>> 09.09.2011  18:13                 0 write.lock
>> 09.09.2011  18:42    30'133'772'814 _jyb4c.fdt
>> 09.09.2011  18:42       103'164'812 _jyb4c.fdx
>> 09.09.2011  18:27             5'347 _jyb4c.fnm
>> 09.09.2011  19:03     6'474'023'590 _jyb4c.frq
>> 09.09.2011  19:03     3'699'406'141 _jyb4c.prx
>> 09.09.2011  19:03        37'900'657 _jyb4c.tii
>> 09.09.2011  19:03     3'380'266'875 _jyb4c.tis
>> 09.09.2011  19:15    11'820'477'088 _jyb4e.fdt
>> 09.09.2011  19:15       101'659'700 _jyb4e.fdx
>> 09.09.2011  19:03             5'347 _jyb4e.fnm
>> 09.09.2011  19:29     5'333'219'797 _jyb4e.frq
>> 09.09.2011  19:29     1'734'633'179 _jyb4e.prx
>> 09.09.2011  19:29        25'105'023 _jyb4e.tii
>> 09.09.2011  19:29     2'242'558'333 _jyb4e.tis
>> 09.09.2011  19:31           223'600 _jyb5t.cfs
>> 09.09.2011  19:31             9'545 _jyb5u.cfs
>> 09.09.2011  19:31             8'963 _jyb5v.cfs
>> 09.09.2011  19:31             9'250 _jyb5w.cfs
>> 09.09.2011  19:31             9'047 _jyb5x.cfs
>> 09.09.2011  19:31            11'253 _jyb5y.cfs
>> 09.09.2011  19:31            11'239 _jyb5z.cfs
>>              24 File(s) 65'086'483'701 bytes
>>
>> any clue to what is happenning?
>>
>> thanks,
>>
>>
>> Vincent
>>
>>
>>
>>
>>
>>
>>
>>
>> "Uwe Schindler" <uwe@thetaphi.de>
>>
>>
>> 21.07.2011 22:46
>> Please respond to
>> java-user@lucene.apache.org
>>
>>
>>
>> To
>> <java-user@lucene.apache.org>
>> cc
>>
>> Subject
>> RE: optimize with num segments > 1 index keeps growing
>>
>>
>>
>>
>>
>>
>> There is also expungeDeletes()...
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: v.sevel@lombardodier.com [mailto:v.sevel@lombardodier.com]
>>> Sent: Thursday, July 21, 2011 8:39 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: optimize with num segments > 1 index keeps growing
>>>
>>> Hi, thanks for this explanation.
>>> so what is the best solution: merge the large segment (how can I do
>> that)
>> or
>>> work with many segments (10?) so that I will avoid have this "large
>> segment"
>>> issue?
>>> thanks,
>>> vince
>>>
>>>
>>> Vincent Sevel
>>> Lombard Odier Darier Hentsch & Cie
>>> 11, rue de la Corraterie - 1204 Genève - Suisse T +41 22 709 3376 - F
>> +41
>> 22 709
>>> 3782 www.lombardodier.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Simon Willnauer <simon.willnauer@googlemail.com>
>>>
>>>
>>> 21.07.2011 20:06
>>> Please respond to
>>> java-user@lucene.apache.org
>>>
>>>
>>>
>>> To
>>> java-user@lucene.apache.org
>>> cc
>>>
>>> Subject
>>> Re: optimize with num segments > 1 index keeps growing
>>>
>>>
>>>
>>>
>>>
>>>
>>> so the problem here is that you have one really big segment _52aho.*
> and
>>> several smaller ones _7e0wz.*, _7e0xu.*, _7e1x5.* ....
>>> if you optimize to 2 segmetns all the smaller segments are merged into
>> one
>>> but all the large segment remains untouched. This means that all
> deleted
>>> documents in the large segment are not removed / freed while if you
>>> optimized to one segment they are removed. In the single seg.
>>> index there is no *.del file present meaning no deletes. Unless you
>> merge
>>> the large segment all you deleted documents are only marked as delete
>> but
>>> not yet removed.
>>>
>>> simon
>>>
>>> On Thu, Jul 21, 2011 at 5:50 PM,  <v.sevel@lombardodier.com> wrote:
>>> > hi,
>>> > closing after the 2 segments optimize does not change it.
>>> > also I am running with lucene 3.1.0.
>>> > cheers,
>>> > vince
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Ian Lea <ian.lea@gmail.com>
>>> >
>>> >
>>> > 21.07.2011 17:30
>>> > Please respond to
>>> > java-user@lucene.apache.org
>>> >
>>> >
>>> >
>>> > To
>>> > java-user@lucene.apache.org
>>> > cc
>>> >
>>> > Subject
>>> > Re: optimize with num segments > 1 index keeps growing
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > A write.lock file with timestamp of 13:58 is in all the listings. The
>>> > first thing I'd try is to add some IndexWriter.close() calls.
>>> >
>>> >
>>> > --
>>> > Ian.
>>> >
>>> >
>>> >
>>> > On Thu, Jul 21, 2011 at 4:05 PM,  <v.sevel@lombardodier.com> wrote:
>>> >> Hi,
>>> >>
>>> >> here is a concrete example.
>>> >>
>>> >> I am starting with an index that has 19017236 docs, which takes
> 58989
>>> Mb
>>> >> on disk:
>>> >>
>>> >> 21.07.2011 15:21                20 segments.gen
>>> >> 21.07.2011 15:21             2'974 segments_2acy4
>>> >> 21.07.2011 13:58                 0 write.lock
>>> >> 16.07.2011  02:21    33'445'798'886 _52aho.fdt
>>> >> 16.07.2011  02:21       178'723'932 _52aho.fdx
>>> >> 16.07.2011  01:58             5'002 _52aho.fnm
>>> >> 16.07.2011  03:10     9'857'410'889 _52aho.frq
>>> >> 16.07.2011  03:10     4'538'234'846 _52aho.prx
>>> >> 16.07.2011  03:10        61'581'767 _52aho.tii
>>> >> 16.07.2011  03:10     5'505'039'790 _52aho.tis
>>> >> 21.07.2011 01:01         1'899'536 _52aho_5.del
>>> >> 21.07.2011 01:05     4'222'206'034 _6t61z.fdt
>>> >> 21.07.2011 01:05        21'424'556 _6t61z.fdx
>>> >> 21.07.2011 01:01             5'002 _6t61z.fnm
>>> >> 21.07.2011 01:12     1'170'370'187 _6t61z.frq
>>> >> 21.07.2011  01:12       598'373'388 _6t61z.prx
>>> >> 21.07.2011  01:12         7'574'912 _6t61z.tii
>>> >> 21.07.2011  01:12       678'766'206 _6t61z.tis
>>> >> 21.07.2011  13:46     1'458'592'058 _7d6me.cfs
>>> >> 21.07.2011  13:48        15'702'654 _7dhgz.cfs
>>> >> 21.07.2011  13:52        16'800'942 _7dphm.cfs
>>> >> 21.07.2011  13:55        16'714'431 _7dxht.cfs
>>> >> 21.07.2011  14:24        17'505'435 _7e0wz.cfs
>>> >> 21.07.2011  14:24         5'875'852 _7e0xu.cfs
>>> >> 21.07.2011  14:48        18'340'470 _7e1x5.cfs
>>> >> 21.07.2011  15:19        16'978'564 _7e3ck.cfs
>>> >> 21.07.2011  15:21         1'208'656 _7e3hv.cfs
>>> >> 21.07.2011  15:21            19'361 _7e3hw.cfs
>>> >>              28 File(s) 61'855'156'350 bytes
>>> >>
>>> >> I am doing a delete of some of the older documents. after the
> delete,
>>> >> I commit then I optimize down to 2 segments. at the end of the
>>> >> optimize
>>> > the
>>> >> index contains 18702510 docs (314727 were deleted) and it takes now
>>> > 58975
>>> >> Mb on disk:
>>> >>
>>> >> 21.07.2011  15:37                20 segments.gen
>>> >> 21.07.2011  15:37               524 segments_2acy6
>>> >> 21.07.2011  13:58                 0 write.lock
>>> >> 16.07.2011  02:21    33'445'798'886 _52aho.fdt
>>> >> 16.07.2011  02:21       178'723'932 _52aho.fdx
>>> >> 16.07.2011  01:58             5'002 _52aho.fnm
>>> >> 16.07.2011  03:10     9'857'410'889 _52aho.frq
>>> >> 16.07.2011  03:10     4'538'234'846 _52aho.prx
>>> >> 16.07.2011  03:10        61'581'767 _52aho.tii
>>> >> 16.07.2011  03:10     5'505'039'790 _52aho.tis
>>> >> 21.07.2011  15:23         1'999'945 _52aho_6.del
>>> >> 21.07.2011  15:31     5'194'848'138 _7e3hy.fdt
>>> >> 21.07.2011  15:31        28'613'668 _7e3hy.fdx
>>> >> 21.07.2011  15:25             5'002 _7e3hy.fnm
>>> >> 21.07.2011  15:37     1'529'771'296 _7e3hy.frq
>>> >> 21.07.2011  15:37       726'582'244 _7e3hy.prx
>>> >> 21.07.2011  15:37         8'518'198 _7e3hy.tii
>>> >> 21.07.2011  15:37       763'213'144 _7e3hy.tis
>>> >>              18 File(s) 61'840'347'291 bytes
>>> >>
>>> >> as you can see, size on disk did not really change. at this point I
>>> >> optimize down to 1 segment and at the end the index takes 48273 Mb
> on
>>> >> disk:
>>> >>
>>> >> 21.07.2011  16:46                20 segments.gen
>>> >> 21.07.2011  16:46               278 segments_2acy8
>>> >> 21.07.2011  13:58                 0 write.lock
>>> >> 21.07.2011  16:06    32'901'423'750 _7e3hz.fdt
>>> >> 21.07.2011  16:06       149'582'052 _7e3hz.fdx
>>> >> 21.07.2011  15:42             5'002 _7e3hz.fnm
>>> >> 21.07.2011  16:46     8'608'541'177 _7e3hz.frq
>>> >> 21.07.2011  16:46     4'392'616'115 _7e3hz.prx
>>> >> 21.07.2011  16:46        50'571'856 _7e3hz.tii
>>> >> 21.07.2011  16:46     4'515'914'658 _7e3hz.tis
>>> >>              10 File(s) 50'618'654'908 bytes
>>> >>
>>> >>
>>> >> this means that with the 1 segment optimize I was able to reclaim 10
>>> >> Gb
>>> > on
>>> >> disk that the 2 segments optimize could not achieve.
>>> >>
>>> >> how can this be explained? is that a normal behavior?
>>> >>
>>> >> thanks,
>>> >>
>>> >> vince
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Simon Willnauer <simon.willnauer@googlemail.com>
>>> >>
>>> >>
>>> >> 20.07.2011 23:11
>>> >> Please respond to
>>> >> java-user@lucene.apache.org
>>> >>
>>> >>
>>> >>
>>> >> To
>>> >> java-user@lucene.apache.org
>>> >> cc
>>> >>
>>> >> Subject
>>> >> Re: optimize with num segments > 1 index keeps growing
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jul 20, 2011 at 2:00 PM,  <v.sevel@lombardodier.com>
wrote:
>>> >>> Hi,
>>> >>>
>>> >>> I index several millions small documents per day. each day, I
> remove
>>> >> some
>>> >>> of the older documents to keep the index at a stable number of
>>> >> documents.
>>> >>> after each purge, I commit then I optimize the index. what I found
>>> >>> is
>>> >> that
>>> >>> if I keep optimizing with max num segments = 2, then the index
> keeps
>>> >>> growing on the disk. but as soon as I optimize with just 1 segment,
>>> the
>>> >>> space gets reclaimed on the disk. so, I have currently adopted the
>>> >>> following strategy : every night I optimize with 2 segments, except
>>> > once
>>> >>> per week where I optimize with just 1 segment.
>>> >>
>>> >> what do you mean by keeps growing. you have n segments and you
>>> >> optimize down to 2 and the index is bigger than the one with n
>>> >> segments?
>>> >>
>>> >> simon
>>> >>>
>>> >>> is that an expected behavior?
>>> >>> I guess I am doing something special because I was not able to
>>> > reproduce
>>> >>> this behavior in a unit test. what could it be?
>>> >>>
>>> >>> it would be nice to get some explanatory services within the
> product
>>> to
>>> >>> help get some understanding on its behavior. something that tells
>>> >>> you
>>> >> some
>>> >>> information about your index for instance (number of docs in the
>>> >> different
>>> >>> states, how the space is being used, ...). lucene is a wonderful
>>> >> product,
>>> >>> but to me this is almost like black magic, and when there is a
>>> specific
>>> >>> behavior, I have got little clues to figure out something by
> myself.
>>> >> some
>>> >>> user oriented logging would be nice as well (the index writer info
>>> >> stream
>>> >>> is really verbose and very low level).
>>> >>>
>>> >>> thanks for your help,
>>> >>>
>>> >>>
>>> >>> Vince
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>> >
>>> >
>>> > ************************ DISCLAIMER
>>> ************************ This
>>> > message is intended only for use by the person to whom it is
>>> > addressed. It may contain information that is privileged and
>>> > confidential. Its content does not constitute a formal commitment by
>>> > Lombard Odier Darier Hentsch & Cie or any of its branches or
>>> > affiliates.
>>> > If you are not the intended recipient of this message, kindly notify
>>> > the sender immediately and destroy this message. Thank You.
>>> >
>>> **********************************************************
>>> *******
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>> ************************ DISCLAIMER ************************
>>> This message is intended only for use by the person to whom it is
>> addressed.
>>> It may contain information that is privileged and confidential. Its
>> content
>>> does not constitute a formal commitment by Lombard Odier Darier Hentsch
>>> & Cie or any of its branches or affiliates.
>>> If you are not the intended recipient of this message, kindly notify
> the
>>> sender immediately and destroy this message. Thank You.
>>> **********************************************************
>>> *******
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>> ************************ DISCLAIMER ************************
>> This message is intended only for use by the person to
>> whom it is addressed. It may contain information that is
>> privileged and confidential. Its content does not
>> constitute a formal commitment by Lombard Odier
>> Darier Hentsch & Cie or any of its branches or affiliates.
>> If you are not the intended recipient of this message,
>> kindly notify the sender immediately and destroy this
>> message. Thank You.
>> *****************************************************************
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ************************ DISCLAIMER ************************
> This message is intended only for use by the person to
> whom it is addressed. It may contain information that is
> privileged and confidential. Its content does not
> constitute a formal commitment by Lombard Odier
> Darier Hentsch & Cie or any of its branches or affiliates.
> If you are not the intended recipient of this message,
> kindly notify the sender immediately and destroy this
> message. Thank You.
> *****************************************************************
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message