From general-return-1128-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Fri Mar 13 15:16:52 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 43986 invoked from network); 13 Mar 2009 15:16:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Mar 2009 15:16:52 -0000 Received: (qmail 78528 invoked by uid 500); 13 Mar 2009 15:16:51 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 78500 invoked by uid 500); 13 Mar 2009 15:16:51 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 78489 invoked by uid 99); 13 Mar 2009 15:16:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Mar 2009 08:16:51 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.92.26] (HELO qw-out-2122.google.com) (74.125.92.26) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Mar 2009 15:16:41 +0000 Received: by qw-out-2122.google.com with SMTP id 5so1246206qwi.53 for ; Fri, 13 Mar 2009 08:16:19 -0700 (PDT) Received: by 10.224.36.194 with SMTP id u2mr2106471qad.119.1236957379710; Fri, 13 Mar 2009 08:16:19 -0700 (PDT) Received: from ?10.17.4.4? (pool-173-48-164-75.bstnma.fios.verizon.net [173.48.164.75]) by mx.google.com with ESMTPS id 6sm488422qwd.9.2009.03.13.08.16.18 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 13 Mar 2009 08:16:18 -0700 (PDT) Message-Id: From: Michael McCandless To: general@lucene.apache.org In-Reply-To: <20090313142633.7td1f310bkkksk0c@webmail.digiatlas.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: problems with large Lucene index (reason found) Date: Fri, 13 Mar 2009 11:16:16 -0400 References: <20090305091642.tn2pa2480gokgwg8@webmail.digiatlas.org> <20090309142553.owuism4vsw0wwwg8@webmail.digiatlas.org> <20090311132002.fxsszjida8488o8c@webmail.digiatlas.org> <4786E70C-D416-4D21-BFFC-90667AD4002B@mikemccandless.com> <20090312143009.9ute77bb400kkcok@webmail.digiatlas.org> <20090313093836.d8dyfjus08gkcg48@webmail.digiatlas.org> <09F959F1-A759-44CD-90FB-F4D6C4822512@mikemccandless.com> <20090313142633.7td1f310bkkksk0c@webmail.digiatlas.org> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org lucene@digiatlas.org wrote: > Yes, I overrode the read() method in > FSDirectory.FSIndexInput.Descriptor and forced it to read in 50Mb > chunks and do an arraycopy() into the array created by Lucene. It > now works with any heap size and doesn't get OOM. You shouldn't need to do the extra arraycopy? RandomAccessFile can read into a particular offset/len inside the array. Does that not work? > There may be other areas this could happen in the Lucene code > (although at present it seems to be working fine for me on our > largest, 17Gb, index but I haven't tried accessing data yet - only > getting the result size - so perhaps there are other calls to read() > with large buffer sizes). > > As this bug does not look like it will be fixed in the near future, > it might be an idea to put in place a fix in the Lucene code. I > think it would be safe to read in chunks of up to 100Mb without a > problem and I don't think it will affect performance to any great > degree. I agree. Can you open a Jira issue and post a patch? > It's pleasing to see that Lucene can easily handle such huge > indexes, although this bug is obviously quite an impediment to doing > so. Yes indeed. This is one crazy bug. Mike