lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: OutOfMemoryError tokenizing a boring text file
Date Sat, 01 Sep 2007 19:06:55 GMT
I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that....

Best
Erick

On 8/31/07, Per Lindberg <per@implior.com> wrote:
>
> I'm creating a tokenized "content" Field from a plain text file
> using an InputStreamReader and new Field("content", in);
>
> The text file is large, 20 MB, and contains zillions lines,
> each with the the same 100-character token.
>
> That causes an OutOfMemoryError.
>
> Given that all tokens are the *same*,
> why should this cause an OutOfMemoryError?
> Shouldn't StandardAnalyzer just chug along
> and just note "ho hum, this token is the same"?
> That shouldn't take too much memory.
>
> Or have I missed something?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message