lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index
Date Fri, 08 Jan 2010 13:38:24 GMT
Lucene stores 1 byte (disk and RAM, when searching that field) per
document for any field that has norms enabled, even for documents that
do not contain that field.

In your case, that's ~20 MB per field (once optimize is done), times
559 fields = ~11TB of storage.

You should index these fields with Field.Index.ANALYZED_NO_NORMS to
turn off norms.  But, this means field/doc boosting, and the normal
length boosting Lucene normally does (shorter documents get a better
score), will be silently disabled.  Also: you must fully re-index from
scratch, otherwise the norms will turn themselves back on when
segments merge together.

Mike

On Fri, Jan 8, 2010 at 7:55 AM, Yuliya Palchaninava <yp@solute.de> wrote:
> Thanks Michael.
>
> You are probably wright.
>
> Not optimized size is 4.1G, optimized index is about 15G.
>
> Yes, our documents do have many different indexed fields and norms are enabled.
> Nr of fields: 559
> Nr of documents: 20845906
> Nr of terms: 25615389
>
> Could you please give me a more detailled explanation, how the storage of norms effects
the size of an index.
> What do you mean exactly with "norms are not stored sparsely"?
>
> Thanks,
> Yuliya
>
>> -----Ursprüngliche Nachricht-----
>> Von: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Gesendet: Donnerstag, 7. Januar 2010 18:00
>> An: java-user@lucene.apache.org
>> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as
>> large as the not optimized index
>>
>> Do your documents have many different indexed fields?  If you
>> do, and norms are enabled, that could be the cause (norms are
>> not stored sparsely).
>>
>> But: what actual sizes are we talking about?
>>
>> Mike
>>
>> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava
>> <yp@solute.de> wrote:
>> > Otis,
>> >
>> > thanks for the answer.
>> >
>> > Unfortunatelly the index *directory* remains larger *after"
>> the optimization.
>> > In our case the otimization was/is completed successfully
>> and, as you
>> > say, there is only one segment in the directory.
>> >
>> > Some other ideas?
>> >
>> > Thanks,
>> > Yuliya
>> >
>> >> -----Ursprüngliche Nachricht-----
>> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> >> Gesendet: Donnerstag, 7. Januar 2010 17:35
>> >> An: java-user@lucene.apache.org
>> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice
>> as large
>> >> as the not optimized index
>> >>
>> >> Yuliya,
>> >>
>> >> The index *directory* will be larger *while* you are optimizing.
>> >> After the optimization is completed successfully, the
>> index directory
>> >> will be smaller.  It is possible that your index directory is
>> >> large(r) because you have some left-over segments (e.g. from some
>> >> earlier failed/interrupted optimizations) that are not
>> really a part
>> >> of the index.  After optimizing, you should have only 1
>> segment, so
>> >> if you see more than 1 segment, look at the ones with older
>> >> timestamps.  Those can be (re)moved.
>> >>
>> >>  Otis
>> >> --
>> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> > From: Yuliya Palchaninava <yp@solute.de>
>> >> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
>> >> > Sent: Thu, January 7, 2010 11:23:08 AM
>> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
>> >> large as the
>> >> > not optimized index
>> >> >
>> >> > Hi,
>> >> >
>> >> > According to the api documentation: "In general, once
>> the optimize
>> >> > completes, the total size of the index will be less than
>> >> the size of
>> >> > the starting index. It could be quite a bit smaller (if
>> there were
>> >> > many pending deletes) or just slightly smaller". In our
>> >> case the index
>> >> > becomes not smaller but larger, namely thrice as large.
>> >> >
>> >> > The not optimized index doesn't contain compressed fields,
>> >> what could
>> >> > have caused the growth of the index due to the
>> otimization. So we
>> >> > cannot explain what happens.
>> >> >
>> >> > Does someone have an explanation for the index growth due
>> >> to the optimization?
>> >> >
>> >> > Thanks,
>> >> > Yuliya
>> >> >
>> >> >
>> >> >
>> >>
>> ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >>
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message