lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: SV: OutOfMemoryError tokenizing a boring text file
Date Mon, 03 Sep 2007 17:22:48 GMT

: Setting writer.setMaxFieldLength(5000) (default is 10000)
: seems to eliminate the risk for an OutOfMemoryError,

that's because it now gives up after parsing 5000 tokens.

: To me, it appears that simply calling
:    new Field("content", new InputStreamReader(in, "ISO-8859-1"))
: on a plain text file causes Lucene to buffer it *all*.

Looking at this purely from an outside in perspective: how could that
be true?  If it was then why would calling setMaxFieldLength(5000) 
solve your problem -- limiting the number of tokens wouldn't matter if the 
problem occured becuase Lucene was buffering the entire reader.

It definitely seems like there is some room for improvement here ... it 
sounds almost like mayber there is a [HAND WAVEY AIR QUOTES] memory/object 
leakish [/HAND WAVEY AIR QUOTES] situation where even after a Token is 
read off the TokenStream the Token isn't being GCed.

Per: perhaps you could open a Jira issue and attach a unit test 
demonstrating the problem?  maybe something with an artificial Reader that 
just churns out a repeating sequence of characters forever?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message