Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 63303 invoked from network); 8 Jan 2010 14:00:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Jan 2010 14:00:49 -0000 Received: (qmail 28306 invoked by uid 500); 8 Jan 2010 14:00:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 28237 invoked by uid 500); 8 Jan 2010 14:00:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28227 invoked by uid 99); 8 Jan 2010 14:00:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jan 2010 14:00:47 +0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=RCVD_IN_BL_SPAMCOP_NET,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [62.146.193.99] (HELO mx.billiger.de) (62.146.193.99) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jan 2010 14:00:36 +0000 Received: from solute-exc2k7.solute.ka (norad.solute-gmbh.de [145.253.139.66]) by mx.billiger.de (Postfix) with ESMTPS id 6439785F996 for ; Fri, 8 Jan 2010 15:00:18 +0100 (CET) Received: from solute-exc2k7.solute.ka ([10.1.1.3]) by solute-exc2k7.solute.ka ([10.1.1.3]) with mapi; Fri, 8 Jan 2010 15:00:16 +0100 From: Yuliya Palchaninava To: "java-user@lucene.apache.org" Date: Fri, 8 Jan 2010 15:00:16 +0100 Subject: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index Thread-Topic: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index Thread-Index: AcqQZ/ColQx/tO1BSsiWHfbRk1+7ogAAKh9g Message-ID: <0523297188849D479C16D5657F665769BB5CED4874@solute-exc2k7.solute.ka> References: <0523297188849D479C16D5657F665769BB5CED486A@solute-exc2k7.solute.ka> <943192.67571.qm@web50305.mail.re2.yahoo.com> <0523297188849D479C16D5657F665769BB5CED486B@solute-exc2k7.solute.ka> <9ac0c6aa1001070859q3363ac1k68786a0165999219@mail.gmail.com> <0523297188849D479C16D5657F665769BB5CED4872@solute-exc2k7.solute.ka> <9ac0c6aa1001080538k11bd56f9n1a6b7eaa86cba1be@mail.gmail.com> In-Reply-To: <9ac0c6aa1001080538k11bd56f9n1a6b7eaa86cba1be@mail.gmail.com> Accept-Language: en-US, de-DE Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Mike, thanks a lot! That's exactly what we'll do. Actually we have a lot of dynamic fields which are not analyzed and not inv= olved in field/document boosting, so we can disable norms on these fields w= ithout problems.=20 Thanks again. Yuliya =20 > -----Urspr=FCngliche Nachricht----- > Von: Michael McCandless [mailto:lucene@mikemccandless.com]=20 > Gesendet: Freitag, 8. Januar 2010 14:38 > An: java-user@lucene.apache.org > Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as=20 > large as the not optimized index >=20 > Lucene stores 1 byte (disk and RAM, when searching that=20 > field) per document for any field that has norms enabled,=20 > even for documents that do not contain that field. >=20 > In your case, that's ~20 MB per field (once optimize is done), times > 559 fields =3D ~11TB of storage. >=20 > You should index these fields with=20 > Field.Index.ANALYZED_NO_NORMS to turn off norms. But, this=20 > means field/doc boosting, and the normal length boosting=20 > Lucene normally does (shorter documents get a better score),=20 > will be silently disabled. Also: you must fully re-index=20 > from scratch, otherwise the norms will turn themselves back=20 > on when segments merge together. >=20 > Mike >=20 > On Fri, Jan 8, 2010 at 7:55 AM, Yuliya Palchaninava=20 > wrote: > > Thanks Michael. > > > > You are probably wright. > > > > Not optimized size is 4.1G, optimized index is about 15G. > > > > Yes, our documents do have many different indexed fields=20 > and norms are enabled. > > Nr of fields: 559 > > Nr of documents: 20845906 > > Nr of terms: 25615389 > > > > Could you please give me a more detailled explanation, how=20 > the storage of norms effects the size of an index. > > What do you mean exactly with "norms are not stored sparsely"? > > > > Thanks, > > Yuliya > > > >> -----Urspr=FCngliche Nachricht----- > >> Von: Michael McCandless [mailto:lucene@mikemccandless.com] > >> Gesendet: Donnerstag, 7. Januar 2010 18:00 > >> An: java-user@lucene.apache.org > >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice=20 > as large=20 > >> as the not optimized index > >> > >> Do your documents have many different indexed fields? =A0If=20 > you do, and=20 > >> norms are enabled, that could be the cause (norms are not stored=20 > >> sparsely). > >> > >> But: what actual sizes are we talking about? > >> > >> Mike > >> > >> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava=20 > =20 > >> wrote: > >> > Otis, > >> > > >> > thanks for the answer. > >> > > >> > Unfortunatelly the index *directory* remains larger *after" > >> the optimization. > >> > In our case the otimization was/is completed successfully > >> and, as you > >> > say, there is only one segment in the directory. > >> > > >> > Some other ideas? > >> > > >> > Thanks, > >> > Yuliya > >> > > >> >> -----Urspr=FCngliche Nachricht----- > >> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] > >> >> Gesendet: Donnerstag, 7. Januar 2010 17:35 > >> >> An: java-user@lucene.apache.org > >> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice > >> as large > >> >> as the not optimized index > >> >> > >> >> Yuliya, > >> >> > >> >> The index *directory* will be larger *while* you are optimizing. > >> >> After the optimization is completed successfully, the > >> index directory > >> >> will be smaller. =A0It is possible that your index directory is > >> >> large(r) because you have some left-over segments (e.g.=20 > from some=20 > >> >> earlier failed/interrupted optimizations) that are not > >> really a part > >> >> of the index. =A0After optimizing, you should have only 1 > >> segment, so > >> >> if you see more than 1 segment, look at the ones with older=20 > >> >> timestamps. =A0Those can be (re)moved. > >> >> > >> >> =A0Otis > >> >> -- > >> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > >> >> > >> >> > >> >> > >> >> ----- Original Message ---- > >> >> > From: Yuliya Palchaninava > >> >> > To: "java-user@lucene.apache.org"=20 > > >> >> > Sent: Thu, January 7, 2010 11:23:08 AM > >> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as > >> >> large as the > >> >> > not optimized index > >> >> > > >> >> > Hi, > >> >> > > >> >> > According to the api documentation: "In general, once > >> the optimize > >> >> > completes, the total size of the index will be less than > >> >> the size of > >> >> > the starting index. It could be quite a bit smaller (if > >> there were > >> >> > many pending deletes) or just slightly smaller". In our > >> >> case the index > >> >> > becomes not smaller but larger, namely thrice as large. > >> >> > > >> >> > The not optimized index doesn't contain compressed fields, > >> >> what could > >> >> > have caused the growth of the index due to the > >> otimization. So we > >> >> > cannot explain what happens. > >> >> > > >> >> > Does someone have an explanation for the index growth due > >> >> to the optimization? > >> >> > > >> >> > Thanks, > >> >> > Yuliya > >> >> > > >> >> > > >> >> > > >> >> > >>=20 > --------------------------------------------------------------------- > >> >> > To unsubscribe, e-mail:=20 > java-user-unsubscribe@lucene.apache.org > >> >> > For additional commands, e-mail:=20 > >> >> > java-user-help@lucene.apache.org > >> >> > >> >> > >> >> > >>=20 > --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> >> For additional commands, e-mail:=20 > java-user-help@lucene.apache.org > >> >> > >> >> > >> > > >>=20 > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> > For additional commands, e-mail: java-user-help@lucene.apache.org > >> > > >> > > >> > >>=20 > --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >=20 > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 > = --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org