Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 71394 invoked from network); 1 Sep 2007 20:06:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Sep 2007 20:06:58 -0000 Received: (qmail 47233 invoked by uid 500); 1 Sep 2007 20:06:48 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 47198 invoked by uid 500); 1 Sep 2007 20:06:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 47187 invoked by uid 99); 1 Sep 2007 20:06:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Sep 2007 13:06:47 -0700 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 80.76.149.212 is neither permitted nor denied by domain of karl.wettin@gmail.com) Received: from [80.76.149.212] (HELO ch-smtp01.sth.basefarm.net) (80.76.149.212) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Sep 2007 20:06:43 +0000 Received: from c83-249-118-113.bredband.comhem.se ([83.249.118.113]:52556 helo=[192.168.2.101]) by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.66) (envelope-from ) id 1IRZEb-0003qe-3L for java-user@lucene.apache.org; Sat, 01 Sep 2007 22:06:21 +0200 Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: References: <20070831141729.0BCBE72494C@athena.apache.org> <359a92830709011206o34a73094w4598d435c99b352b@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <2F38A9BE-413A-4E1A-93D7-99D59B237380@gmail.com> Content-Transfer-Encoding: 7bit From: Karl Wettin Subject: Re: OutOfMemoryError tokenizing a boring text file Date: Sat, 1 Sep 2007 22:00:25 +0200 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.3) X-Originating-IP: 83.249.118.113 X-Scan-Result: No virus found in message 1IRZEb-0003qe-3L. X-Scan-Signature: ch-smtp01.sth.basefarm.net 1IRZEb-0003qe-3L 244d1442e812710e39cf3b6b8ff5d538 X-Virus-Checked: Checked by ClamAV on apache.org I belive the problem is that the text value is not the only data associated with a token, there is for instance the position offset. Depending on your JVM, each instance reference consume 64 bits or so, so even if the text value is flyweighted by String.intern() there is a cost. I doubt that a document is flushed to the segment prior to a fields token stream has been exhaused. -- karl 1 sep 2007 kl. 21.50 skrev Askar Zaidi: > I have indexed around 100 M of data with 512M to the JVM heap. So > that gives > you an idea. If every token is the same word in one file, shouldn't > the > tokenizer recognize that ? > > Try using Luke. That helps solving lots of issues. > > - > AZ > > On 9/1/07, Erick Erickson wrote: >> >> I can't answer the question of why the same token >> takes up memory, but I've indexed far more than >> 20M of data in a single document field. As in on the >> order of 150M. Of course I allocated 1G or so to the >> JVM, so you might try that.... >> >> Best >> Erick >> >> On 8/31/07, Per Lindberg wrote: >>> >>> I'm creating a tokenized "content" Field from a plain text file >>> using an InputStreamReader and new Field("content", in); >>> >>> The text file is large, 20 MB, and contains zillions lines, >>> each with the the same 100-character token. >>> >>> That causes an OutOfMemoryError. >>> >>> Given that all tokens are the *same*, >>> why should this cause an OutOfMemoryError? >>> Shouldn't StandardAnalyzer just chug along >>> and just note "ho hum, this token is the same"? >>> That shouldn't take too much memory. >>> >>> Or have I missed something? >>> >>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org