Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 78740 invoked from network); 15 Jun 2010 07:56:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Jun 2010 07:56:26 -0000 Received: (qmail 76530 invoked by uid 500); 15 Jun 2010 07:56:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 76025 invoked by uid 500); 15 Jun 2010 07:56:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 76016 invoked by uid 99); 15 Jun 2010 07:56:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jun 2010 07:56:20 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=AWL,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [130.225.24.68] (HELO sbexch03.sb.statsbiblioteket.dk) (130.225.24.68) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jun 2010 07:56:14 +0000 Received: from [172.18.218.253] (172.18.218.253) by sbexch03.sb.statsbiblioteket.dk (130.225.24.68) with Microsoft SMTP Server id 8.1.436.0; Tue, 15 Jun 2010 09:55:52 +0200 Subject: Re: is this the right way to go? From: Toke Eskildsen Reply-To: te@statsbiblioteket.dk To: "java-user@lucene.apache.org" In-Reply-To: <1276135397107-884302.post@n3.nabble.com> References: <1276114966071-883464.post@n3.nabble.com> <1276135397107-884302.post@n3.nabble.com> Content-Type: text/plain; charset="UTF-8" Organization: State and University Library, Denmark Date: Tue, 15 Jun 2010 09:56:01 +0200 Message-ID: <1276588561.2569.48.camel@te-laptop> MIME-Version: 1.0 X-Mailer: Evolution 2.28.3 Content-Transfer-Encoding: 7bit On Thu, 2010-06-10 at 04:03 +0200, fujian wrote: > Another thing is about unique. I thought it was unique "field value". If it > means unique term, for English even loading all around 300,000 terms it > won't take much memory, right? (Suppose the average length of term is 10, > the total memory usage is 10*300,000=3MB) It is only the unique field values, but remember that there is also an array of length #docs with pointers to the strings that takes up 4 or 8 bytes/pointer, depending on 32bit/64bit JVM. Furthermore, the current Lucene uses Strings which takes up a lot more than just #chars bytes: 300.000 Strings of average length 10 chars is is about 18MB. http://www.javamex.com/tutorials/memory/string_memory_usage.shtml I'm quietly hacking on a solution for this, but the current code is still at the proof of concept-stage and way too flaky to use for production: https://issues.apache.org/jira/browse/LUCENE-2369 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org