lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Lindberg" <...@implior.com>
Subject SV: OutOfMemoryError tokenizing a boring text file
Date Mon, 03 Sep 2007 13:04:03 GMT
Aha, that's interesting. However...

Setting writer.setMaxFieldLength(5000) (default is 10000)
seems to eliminate the risk for an OutOfMemoryError,
even with a JVM with only 64 MB max memory.
(I have tried larger values for JVM max memory, too).
 
(The name is imho slightly misleading, I would have
called it setMaxFieldTerms or something like that).

Still, 64 bits x 10000 = 78 KB. I can't see why that should
eat up 64 MB, unless the 100 char tokens also are multiplied.

(The 20 MB text file contains roughly 200 000 copies of the
same 100 char string).

To me, it appears that simply calling

   new Field("content", new InputStreamReader(in, "ISO-8859-1"))

on a plain text file causes Lucene to buffer it *all*.


> -----Ursprungligt meddelande-----
> Från: Karl Wettin [mailto:karl.wettin@gmail.com] 
> Skickat: den 1 september 2007 22:00
> Till: java-user@lucene.apache.org
> Ämne: Re: OutOfMemoryError tokenizing a boring text file
> 
> I belive the problem is that the text value is not the only data  
> associated with a token, there is for instance the position offset.  
> Depending on your JVM, each instance reference consume 64 
> bits or so,  
> so even if the text value is flyweighted by String.intern() there is  
> a cost. I doubt that a document is flushed to the segment prior to a  
> fields token stream has been exhaused.
> 
> -- 
> karl
> 
> 
> 1 sep 2007 kl. 21.50 skrev Askar Zaidi:
> 
> > I have indexed around 100 M of data with 512M to the JVM heap. So  
> > that gives
> > you an idea. If every token is the same word in one file, 
> shouldn't  
> > the
> > tokenizer recognize that ?
> >
> > Try using Luke. That helps solving lots of issues.
> >
> > -
> > AZ
> >
> > On 9/1/07, Erick Erickson <erickerickson@gmail.com> wrote:
> >>
> >> I can't answer the question of why the same token
> >> takes up memory, but I've indexed far more than
> >> 20M of data in a single document field. As in on the
> >> order of 150M. Of course I allocated 1G or so to the
> >> JVM, so you might try that....
> >>
> >> Best
> >> Erick
> >>
> >> On 8/31/07, Per Lindberg <per@implior.com> wrote:
> >>>
> >>> I'm creating a tokenized "content" Field from a plain text file
> >>> using an InputStreamReader and new Field("content", in);
> >>>
> >>> The text file is large, 20 MB, and contains zillions lines,
> >>> each with the the same 100-character token.
> >>>
> >>> That causes an OutOfMemoryError.
> >>>
> >>> Given that all tokens are the *same*,
> >>> why should this cause an OutOfMemoryError?
> >>> Shouldn't StandardAnalyzer just chug along
> >>> and just note "ho hum, this token is the same"?
> >>> That shouldn't take too much memory.
> >>>
> >>> Or have I missed something?
> >>>
> >>>
> >>>
> >>>
> >>> 
> -------------------------------------------------------------------- 
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message