From general-return-1074-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Fri Mar 06 09:47:47 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 16953 invoked from network); 6 Mar 2009 09:47:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Mar 2009 09:47:47 -0000 Received: (qmail 78230 invoked by uid 500); 6 Mar 2009 09:47:47 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 77759 invoked by uid 500); 6 Mar 2009 09:47:46 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 77748 invoked by uid 99); 6 Mar 2009 09:47:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Mar 2009 01:47:46 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.92.27] (HELO qw-out-2122.google.com) (74.125.92.27) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Mar 2009 09:47:36 +0000 Received: by qw-out-2122.google.com with SMTP id 5so285204qwi.53 for ; Fri, 06 Mar 2009 01:47:14 -0800 (PST) Received: by 10.224.6.83 with SMTP id 19mr3330614qay.242.1236332834183; Fri, 06 Mar 2009 01:47:14 -0800 (PST) Received: from ?10.17.4.4? (pool-173-48-164-75.bstnma.fios.verizon.net [173.48.164.75]) by mx.google.com with ESMTPS id 6sm1206099ywi.13.2009.03.06.01.47.13 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 06 Mar 2009 01:47:13 -0800 (PST) Message-Id: From: Michael McCandless To: general@lucene.apache.org In-Reply-To: <20090305091642.tn2pa2480gokgwg8@webmail.digiatlas.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: problems with large Lucene index Date: Fri, 6 Mar 2009 04:47:11 -0500 References: <20090305091642.tn2pa2480gokgwg8@webmail.digiatlas.org> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org Lucene is trying to allocate the contiguous norms array for your index, which should be ~273 MB (=286/1024/1024), when it hits the OOM. Is your search sorting by field value? (Which'd also consume memory.) Or it's just the default (by relevance) sort? The only other biggish consumer of memory should be the deleted docs, but that's a BitVector so it should need ~34 MB RAM. Can you run a memory profiler to see what else is consuming RAM? Mike lucene@digiatlas.org wrote: > Hello, > > I am using Lucene via Hibernate Search but the following problem is > also seen using Luke. I'd appreciate any suggestions for solving > this problem. > > I have a Lucene index (27Gb in size) that indexes a database table > of 286 million rows. While Lucene was able to perform this indexing > just fine (albeit very slowly), using the index has proved to be > impossible. Any searches conducted on it, either from my Hibernate > Search query or by placing the query into Luke give: > > java.lang.OutOfMemoryError: Java heap space > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) > at org.apache.lucene.search.TermQuery > $TermWeight.scorer(TermQuery.java:69) > at org.apache.lucene.search.BooleanQuery > $BooleanWeight.scorer(BooleanQuery.java:230) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: > 131) > ... > > > The type of queries are simple, of the form: > > (+value:church +marcField:245 +subField:a) > > which in this example should only return a few thousand results. > > > The interpreter is already running with the maximum of heap space > allowed on for the Java executable running on Windows XP ( java -Xms > 1200m -Xmx 1200m) > > > The Lucene index was created using the following Hibernate Search > annotations: > > @Column > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private Integer marcField; > > @Column (length = 2) > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private String subField; > > @Column(length = 2) > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private String indicator1; > > @Column(length = 2) > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private String indicator2; > > @Column(length = 10000) > @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, > store=Store.NO) > private String value; > > @Column > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private Integer recordId; > > > So all of the fields have NO NORMS except for "value" which is > contains description text that needs to be tokenised. > > Is there any way around this? Does Lucene really have such a low > limit for how much data it can search (and I consider 286 million > documents to be pretty small beer - we were hoping to index a table > of over a billion rows)? Or is there something I'm missing? > > Thanks. > > >