Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 51809 invoked from network); 1 Sep 2007 19:07:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Sep 2007 19:07:32 -0000 Received: (qmail 7569 invoked by uid 500); 1 Sep 2007 19:07:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 7367 invoked by uid 500); 1 Sep 2007 19:07:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 7355 invoked by uid 99); 1 Sep 2007 19:07:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Sep 2007 12:07:21 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.128.184 as permitted sender) Received: from [209.85.128.184] (HELO fk-out-0910.google.com) (209.85.128.184) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Sep 2007 19:07:17 +0000 Received: by fk-out-0910.google.com with SMTP id z23so938695fkz for ; Sat, 01 Sep 2007 12:06:56 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Swi5gfYhTbfu5E9C5JKXBxJY1wN8t+It391pk/hjJ/6mfcILPzIdMWJkNd3SQeNPmjSec5sICVLOTfEUEeotuYFZSiA2VM4jBulYls6KbYeMWKsRVJVrHiUqzSq63EQxebykMPYhQoi5a10F1cq90b5TzgdZpwuLm75ApzB7X8g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Wvs0RB4oszqudFvg04ibNgFi6U7CdrPN4yChl93zjLFsbDKD/NjzrTOinXy1o+8WKkOj9Rf1lbEFP6J47x7HwYj+nldnVP0IiOUEb1ucQAuZEoN49r0yRiutXud6R9paTqcyjDy9DZjxtdV5h05ezdl0k28Gxavt/K7BNZOZ/wk= Received: by 10.82.174.20 with SMTP id w20mr6688386bue.1188673615658; Sat, 01 Sep 2007 12:06:55 -0700 (PDT) Received: by 10.82.190.14 with HTTP; Sat, 1 Sep 2007 12:06:55 -0700 (PDT) Message-ID: <359a92830709011206o34a73094w4598d435c99b352b@mail.gmail.com> Date: Sat, 1 Sep 2007 15:06:55 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: OutOfMemoryError tokenizing a boring text file In-Reply-To: <20070831141729.0BCBE72494C@athena.apache.org> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_2539_28718613.1188673615628" References: <20070831141729.0BCBE72494C@athena.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_2539_28718613.1188673615628 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I can't answer the question of why the same token takes up memory, but I've indexed far more than 20M of data in a single document field. As in on the order of 150M. Of course I allocated 1G or so to the JVM, so you might try that.... Best Erick On 8/31/07, Per Lindberg wrote: > > I'm creating a tokenized "content" Field from a plain text file > using an InputStreamReader and new Field("content", in); > > The text file is large, 20 MB, and contains zillions lines, > each with the the same 100-character token. > > That causes an OutOfMemoryError. > > Given that all tokens are the *same*, > why should this cause an OutOfMemoryError? > Shouldn't StandardAnalyzer just chug along > and just note "ho hum, this token is the same"? > That shouldn't take too much memory. > > Or have I missed something? > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_2539_28718613.1188673615628--