lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maureen tanuwidjaja <autumn_musi...@yahoo.com>
Subject Re: Urgent : How much actually the disk space needed to optimize the index?
Date Tue, 13 Mar 2007 14:08:31 GMT

Oops sorry,mistyping..

  I have the searching result in 30 SECONDS to 3 minutes, which is actually 
quite  unacceptable for the "search engine" I build...Is there any  
recommendation on how faster searching could be done? 
  

maureen tanuwidjaja <autumn_musique@yahoo.com> wrote:  Hi mike
  
  
"The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store."
  It is not possible for me to reduce the number of fields needed to store...
  
  Could you recommend what is the maxMerge value that is small enough to keep all segment
small?
  
  I also would like to ask wheter, if optimize is successful,will it then  perform faster
searching significantly compared to the unoptimized one?
  
  I have the searching result in 30 to 3 minutes, which is actually quite  unacceptable for
the "search engine" I build...Is there any  recommendation on how faster searching could be
done? 
  
  Thanks,
  Maureen
  
  

Michael McCandless  wrote:  "maureen tanuwidjaja"  wrote:

>   "One thing that stands out in your listing is: your norms file
>   (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
>   many tiny docs where each docs has highly variable fields or
>   something?"
>   
>   Ya I also confuse why this nrm file is trmendous in size.
>   I am indexing a total of 657739 XML document .
>   Total number of fields are 37552 fields (I am using XML tags as the
>   field)

OK, this is going to be a problem for Lucene.

This case will definitely go over 2X disk usage during optimize.  I
will update the javadocs to call out this caveat.

The .nrm file (norms) require 1 byte per document per unique field in
the segment, regardless of whether that document has that field (ie,
it's not a "sparse" representation).

When you have many small docs, and each doc has (somewhat) different
fields from the others, this results in a tremendously large storage
for the norms.

The thing is, within one segment it may be OK since that segment has a
subset of all docs and fields.  But then when segments are merged
(like optimize does) the product of #docs and #fields grows
"multiplicatively" and results in far far more storage required than
the sum of the individual segments.

The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.

  
---------------------------------
Looking for earth-friendly autos? 
 Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.  
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message