lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Lindberg" <>
Subject SV: SV: OutOfMemoryError tokenizing a boring text file
Date Tue, 11 Sep 2007 13:28:12 GMT

> Från: Chris Hostetter [] 

> : Setting writer.setMaxFieldLength(5000) (default is 10000)
> : seems to eliminate the risk for an OutOfMemoryError,
> that's because it now gives up after parsing 5000 tokens.
> : To me, it appears that simply calling
> :    new Field("content", new InputStreamReader(in, "ISO-8859-1"))
> : on a plain text file causes Lucene to buffer it *all*.
> Looking at this purely from an outside in perspective: how could that
> be true?  If it was then why would calling setMaxFieldLength(5000) 
> solve your problem -- limiting the number of tokens wouldn't 
> matter if the 
> problem occured becuase Lucene was buffering the entire reader.
> It definitely seems like there is some room for improvement 
> here ... it 
> sounds almost like mayber there is a [HAND WAVEY AIR QUOTES] 
> memory/object 
> leakish [/HAND WAVEY AIR QUOTES] situation where even after a 
> Token is 
> read off the TokenStream the Token isn't being GCed.
> Per: perhaps you could open a Jira issue and attach a unit test 
> demonstrating the problem?  maybe something with an 
> artificial Reader that 
> just churns out a repeating sequence of characters forever?

Isolating the problem is exactly what I did. It took some time,
but the memory leak turned out to be somewhere else, not in
Lucene. (Memory leaks are slippery beasts!)

Just wanted to let y'all know.

Thanks and cheers,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message