lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Huge increase in index size adding just 2 fields
Date Thu, 06 Nov 2008 16:42:34 GMT
I'll make a very wild guess and say that it's possible for this to happen if your dates are
very granular (down to milliseconds).  All of a sudden you probably got 500,000 new terms
there.  Wild guess.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Phillip Farber <pfarber@umich.edu>
> To: solr-user@lucene.apache.org
> Sent: Thursday, November 6, 2008 11:08:18 AM
> Subject: Re: Huge increase in index size adding just 2 fields
> 
> May I ask again whether a index size increase from 120GB to 166GB is expected 
> simply by adding a stored date and a stored repeating string field if length 
> perhaps 20 and roughly 2 values per doc for 500,000 on average?  The doc is a 
> large body of OCR and the position index dominates due to the large number of 
> terms.
> 
> Thanks,
> 
> Phil
> 
> 
> Phillip Farber wrote:
> > 
> > Hi,
> > 
> > We're indexing a lot of dirty OCR. So the index is really huge due to the size 
> of the position file.  We still get ok response time though with a median of 
> 100ms.  Phrase queries are a different matter obviously.  But we're seeing some 
> really large increases in index size as we add a couple of fields that do not 
> make sense.
> > 
> > Our 500,000 document index is 120G. It's simple schema is:
> > 
> > 
> > 
> > 
> > 
> > 
> required="true"/>
> > 
> > We added the following 2 fields to the above schema as follows:
> > 
> > 
> > 
> multiValued="true"/>
> > 
> > where the "hlb" field consists of not more than 3-4 strings such as "Social 
> Sicence"/
> > 
> > Our 500,000 document index size increased to 166G!  This seems completely 
> wrong.  Looking at the directory listings for each case it appears every one of 
> the files grew in size.
> > 
> > How can this be?
> > 
> > Phil
> > 
> > ===
> > 
> > 120G index:
> > 
> > -rw-r--r--  1 tomcat admin     81023261 Sep 24 06:00 _fj.fdt
> > -rw-r--r--  1 tomcat admin      4000072 Sep 24 06:00 _fj.fdx
> > -rw-r--r--  1 tomcat admin           33 Sep 24 06:00 _fj.fnm
> > -rw-r--r--  1 tomcat admin  14069125169 Sep 24 06:16 _fj.frq
> > -rw-r--r--  1 tomcat admin      1500031 Sep 24 06:16 _fj.nrm
> > -rw-r--r--  1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
> > -rw-r--r--  1 tomcat admin     58677668 Sep 24 08:25 _fj.tii
> > -rw-r--r--  1 tomcat admin   4319853217 Sep 24 08:32 _fj.tis
> > -rw-r--r--  1 tomcat admin           42 Sep 24 08:32 segments_fo
> > -rw-r--r--  1 tomcat admin           20 Sep 24 08:32 segments.gen
> > 
> > 166G index (+ 2 fields)
> > 
> > -rw-r--r-- 1 tomcat admin    113530692 Oct 21 10:42 _fh.fdt
> > -rw-r--r-- 1 tomcat admin      3960256 Oct 21 10:42 _fh.fdx
> > -rw-r--r-- 1 tomcat admin           44 Oct 21 10:42 _fh.fnm
> > -rw-r--r-- 1 tomcat admin  15242830112 Oct 21 12:58 _fh.frq
> > -rw-r--r-- 1 tomcat admin      1485100 Oct 21 12:58 _fh.nrm
> > -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
> > -rw-r--r-- 1 tomcat admin     72760439 Oct 21 12:58 _fh.tii
> > -rw-r--r-- 1 tomcat admin   5337669551 Oct 21 12:58 _fh.tis
> > -rw-r--r-- 1 tomcat admin           42 Oct 21 12:58 segments_fk
> > -rw-r--r-- 1 tomcat admin           20 Oct 21 12:58 segments.gen


Mime
View raw message