Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 6497 invoked from network); 6 Nov 2008 16:43:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Nov 2008 16:43:10 -0000 Received: (qmail 28365 invoked by uid 500); 6 Nov 2008 16:43:13 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 28327 invoked by uid 500); 6 Nov 2008 16:43:13 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 28316 invoked by uid 99); 6 Nov 2008 16:43:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Nov 2008 08:43:13 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [206.190.38.60] (HELO web50306.mail.re2.yahoo.com) (206.190.38.60) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 06 Nov 2008 16:41:54 +0000 Received: (qmail 5160 invoked by uid 60001); 6 Nov 2008 16:42:34 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID; b=Ug1Z8zRH5lIvNQD3W3SB5p/CpJJhX4xRGOCIO9ArSYXgm+eAxuQTvXADR2wHTUl0rrlZijXQtOWRl+0fdY84h8Y06WiqYnIT6d06pG48p7PU+SJ8LQc1YJcVB5vZ1osg/1fUBPLYKhBthPORlaoyRHFoqLRypf9wekU1OKBVKlg=; X-YMail-OSG: HWG0Q50VM1nDBXZXYW7lR2s62ubip6xjmo8fowFyqCQOMmFjrUqXwxEOFvFCuEeytDfEyQv1Y8pru127T5zMXzXqHZpD5pHpQIbN4IXWyi4MKy37DuLdXBcfaIEkeIsXh_GNd0K0oK8hpfnlOODDtFuGhfZs7tVyok75fr8t8_Opb78eaHgjivSgZA-- Received: from [167.206.188.3] by web50306.mail.re2.yahoo.com via HTTP; Thu, 06 Nov 2008 08:42:34 PST X-Mailer: YahooMailRC/1155.20 YahooMailWebService/0.7.260.1 References: <490F356B.1080806@umich.edu> <49131672.1080301@umich.edu> Date: Thu, 6 Nov 2008 08:42:34 -0800 (PST) From: Otis Gospodnetic Subject: Re: Huge increase in index size adding just 2 fields To: solr-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Message-ID: <897649.4898.qm@web50306.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org I'll make a very wild guess and say that it's possible for this to happen if your dates are very granular (down to milliseconds). All of a sudden you probably got 500,000 new terms there. Wild guess. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Phillip Farber > To: solr-user@lucene.apache.org > Sent: Thursday, November 6, 2008 11:08:18 AM > Subject: Re: Huge increase in index size adding just 2 fields > > May I ask again whether a index size increase from 120GB to 166GB is expected > simply by adding a stored date and a stored repeating string field if length > perhaps 20 and roughly 2 values per doc for 500,000 on average? The doc is a > large body of OCR and the position index dominates due to the large number of > terms. > > Thanks, > > Phil > > > Phillip Farber wrote: > > > > Hi, > > > > We're indexing a lot of dirty OCR. So the index is really huge due to the size > of the position file. We still get ok response time though with a median of > 100ms. Phrase queries are a different matter obviously. But we're seeing some > really large increases in index size as we add a couple of fields that do not > make sense. > > > > Our 500,000 document index is 120G. It's simple schema is: > > > > > > > > > > > > > required="true"/> > > > > We added the following 2 fields to the above schema as follows: > > > > > > > multiValued="true"/> > > > > where the "hlb" field consists of not more than 3-4 strings such as "Social > Sicence"/ > > > > Our 500,000 document index size increased to 166G! This seems completely > wrong. Looking at the directory listings for each case it appears every one of > the files grew in size. > > > > How can this be? > > > > Phil > > > > === > > > > 120G index: > > > > -rw-r--r-- 1 tomcat admin 81023261 Sep 24 06:00 _fj.fdt > > -rw-r--r-- 1 tomcat admin 4000072 Sep 24 06:00 _fj.fdx > > -rw-r--r-- 1 tomcat admin 33 Sep 24 06:00 _fj.fnm > > -rw-r--r-- 1 tomcat admin 14069125169 Sep 24 06:16 _fj.frq > > -rw-r--r-- 1 tomcat admin 1500031 Sep 24 06:16 _fj.nrm > > -rw-r--r-- 1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx > > -rw-r--r-- 1 tomcat admin 58677668 Sep 24 08:25 _fj.tii > > -rw-r--r-- 1 tomcat admin 4319853217 Sep 24 08:32 _fj.tis > > -rw-r--r-- 1 tomcat admin 42 Sep 24 08:32 segments_fo > > -rw-r--r-- 1 tomcat admin 20 Sep 24 08:32 segments.gen > > > > 166G index (+ 2 fields) > > > > -rw-r--r-- 1 tomcat admin 113530692 Oct 21 10:42 _fh.fdt > > -rw-r--r-- 1 tomcat admin 3960256 Oct 21 10:42 _fh.fdx > > -rw-r--r-- 1 tomcat admin 44 Oct 21 10:42 _fh.fnm > > -rw-r--r-- 1 tomcat admin 15242830112 Oct 21 12:58 _fh.frq > > -rw-r--r-- 1 tomcat admin 1485100 Oct 21 12:58 _fh.nrm > > -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx > > -rw-r--r-- 1 tomcat admin 72760439 Oct 21 12:58 _fh.tii > > -rw-r--r-- 1 tomcat admin 5337669551 Oct 21 12:58 _fh.tis > > -rw-r--r-- 1 tomcat admin 42 Oct 21 12:58 segments_fk > > -rw-r--r-- 1 tomcat admin 20 Oct 21 12:58 segments.gen