lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Askar Zaidi" <askar.za...@gmail.com>
Subject Re: OutOfMemoryError tokenizing a boring text file
Date Sat, 01 Sep 2007 19:50:01 GMT
I have indexed around 100 M of data with 512M to the JVM heap. So that gives
you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?

Try using Luke. That helps solving lots of issues.

-
AZ

On 9/1/07, Erick Erickson <erickerickson@gmail.com> wrote:
>
> I can't answer the question of why the same token
> takes up memory, but I've indexed far more than
> 20M of data in a single document field. As in on the
> order of 150M. Of course I allocated 1G or so to the
> JVM, so you might try that....
>
> Best
> Erick
>
> On 8/31/07, Per Lindberg <per@implior.com> wrote:
> >
> > I'm creating a tokenized "content" Field from a plain text file
> > using an InputStreamReader and new Field("content", in);
> >
> > The text file is large, 20 MB, and contains zillions lines,
> > each with the the same 100-character token.
> >
> > That causes an OutOfMemoryError.
> >
> > Given that all tokens are the *same*,
> > why should this cause an OutOfMemoryError?
> > Shouldn't StandardAnalyzer just chug along
> > and just note "ho hum, this token is the same"?
> > That shouldn't take too much memory.
> >
> > Or have I missed something?
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message