lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Farber <pfar...@umich.edu>
Subject Re: Huge increase in index size adding just 2 fields
Date Thu, 06 Nov 2008 16:08:18 GMT
May I ask again whether a index size increase from 120GB to 166GB is 
expected simply by adding a stored date and a stored repeating string 
field if length perhaps 20 and roughly 2 values per doc for 500,000 on 
average?  The doc is a large body of OCR and the position index 
dominates due to the large number of terms.

Thanks,

Phil


Phillip Farber wrote:
> 
> Hi,
> 
> We're indexing a lot of dirty OCR. So the index is really huge due to 
> the size of the position file.  We still get ok response time though 
> with a median of 100ms.  Phrase queries are a different matter 
> obviously.  But we're seeing some really large increases in index size 
> as we add a couple of fields that do not make sense.
> 
> Our 500,000 document index is 120G. It's simple schema is:
> 
> <field name="id" type="string" indexed="true" stored="true" 
> required="true"/>
> <field name="ocr" type="Ocr" indexed="true" stored="false" 
> required="true"/>
> <field name="title" type="Ocr" indexed="true" stored="true" 
> required="true"/>
> <field name="author" type="Ocr" indexed="true" stored="true" 
> required="true"/>
> <field name="rights" type="sint" indexed="true" stored="true" 
> required="true"/>
> 
> We added the following 2 fields to the above schema as follows:
> 
> <field name="date" type="date" indexed="true" stored="true" 
> required="true"/>
> <field name="hlb" type="string" indexed="true" stored="true" 
> multiValued="true"/>
> 
> where the "hlb" field consists of not more than 3-4 strings such as 
> "Social Sicence"/
> 
> Our 500,000 document index size increased to 166G!  This seems 
> completely wrong.  Looking at the directory listings for each case it 
> appears every one of the files grew in size.
> 
> How can this be?
> 
> Phil
> 
> ===
> 
> 120G index:
> 
> -rw-r--r--  1 tomcat admin     81023261 Sep 24 06:00 _fj.fdt
> -rw-r--r--  1 tomcat admin      4000072 Sep 24 06:00 _fj.fdx
> -rw-r--r--  1 tomcat admin           33 Sep 24 06:00 _fj.fnm
> -rw-r--r--  1 tomcat admin  14069125169 Sep 24 06:16 _fj.frq
> -rw-r--r--  1 tomcat admin      1500031 Sep 24 06:16 _fj.nrm
> -rw-r--r--  1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
> -rw-r--r--  1 tomcat admin     58677668 Sep 24 08:25 _fj.tii
> -rw-r--r--  1 tomcat admin   4319853217 Sep 24 08:32 _fj.tis
> -rw-r--r--  1 tomcat admin           42 Sep 24 08:32 segments_fo
> -rw-r--r--  1 tomcat admin           20 Sep 24 08:32 segments.gen
> 
> 166G index (+ 2 fields)
> 
> -rw-r--r-- 1 tomcat admin    113530692 Oct 21 10:42 _fh.fdt
> -rw-r--r-- 1 tomcat admin      3960256 Oct 21 10:42 _fh.fdx
> -rw-r--r-- 1 tomcat admin           44 Oct 21 10:42 _fh.fnm
> -rw-r--r-- 1 tomcat admin  15242830112 Oct 21 12:58 _fh.frq
> -rw-r--r-- 1 tomcat admin      1485100 Oct 21 12:58 _fh.nrm
> -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
> -rw-r--r-- 1 tomcat admin     72760439 Oct 21 12:58 _fh.tii
> -rw-r--r-- 1 tomcat admin   5337669551 Oct 21 12:58 _fh.tis
> -rw-r--r-- 1 tomcat admin           42 Oct 21 12:58 segments_fk
> -rw-r--r-- 1 tomcat admin           20 Oct 21 12:58 segments.gen

Mime
View raw message