Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Subject: Re: is this the right way to go?
From: Toke Eskildsen <te@statsbiblioteket.dk>
Reply-To: te@statsbiblioteket.dk
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
In-Reply-To: <1276135397107-884302.post@n3.nabble.com>
References: <1276114966071-883464.post@n3.nabble.com>
	 <AANLkTimfasPb9JXV1YcQc2K_KzOO_O40hY2kDAj8LDEI@mail.gmail.com>
	 <1276135397107-884302.post@n3.nabble.com>
Content-Type: text/plain; charset="UTF-8"
Organization: State and University Library, Denmark
Date: Tue, 15 Jun 2010 09:56:01 +0200
Message-ID: <1276588561.2569.48.camel@te-laptop>
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

On Thu, 2010-06-10 at 04:03 +0200, fujian wrote:
> Another thing is about unique. I thought it was unique "field value". If it
> means unique term, for English even loading all around 300,000 terms it
> won't take much memory, right? (Suppose the average length of term is 10,
> the total memory usage is 10*300,000=3MB)

It is only the unique field values, but remember that there is also an
array of length #docs with pointers to the strings that takes up 4 or 8
bytes/pointer, depending on 32bit/64bit JVM. Furthermore, the current
Lucene uses Strings which takes up a lot more than just #chars bytes:
300.000 Strings of average length 10 chars is is about 18MB.
http://www.javamex.com/tutorials/memory/string_memory_usage.shtml


I'm quietly hacking on a solution for this, but the current code is
still at the proof of concept-stage and way too flaky to use for
production: https://issues.apache.org/jira/browse/LUCENE-2369


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org