lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: Urgent : How much actually the disk space needed to optimize the index?
Date Tue, 13 Mar 2007 13:44:55 GMT
"maureen tanuwidjaja" <autumn_musique@yahoo.com> wrote:

>   "One thing that stands out in your listing is: your norms file
>   (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
>   many tiny docs where each docs has highly variable fields or
>   something?"
>   
>   Ya I also confuse why this nrm file is trmendous in size.
>   I am indexing a total of 657739 XML document .
>   Total number of fields are 37552 fields (I am using XML tags as the
>   field)

OK, this is going to be a problem for Lucene.

This case will definitely go over 2X disk usage during optimize.  I
will update the javadocs to call out this caveat.

The .nrm file (norms) require 1 byte per document per unique field in
the segment, regardless of whether that document has that field (ie,
it's not a "sparse" representation).

When you have many small docs, and each doc has (somewhat) different
fields from the others, this results in a tremendously large storage
for the norms.

The thing is, within one segment it may be OK since that segment has a
subset of all docs and fields.  But then when segments are merged
(like optimize does) the product of #docs and #fields grows
"multiplicatively" and results in far far more storage required than
the sum of the individual segments.

The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message