lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Litchfield" <...@benlitchfield.com>
Subject Re: Out of memory error
Date Thu, 13 Jul 2006 15:15:27 GMT
By 300MG I assume you mean 300MB.

You can also try extracting the text outside of lucene by using a 
PDFBox command line app.  

java org.pdfbox.ExtractText <pdffile>

you may need to increase the JRE memory like this

java -Xmx512m .pdfbox.ExtractText <pdffile>

OR

java -Xmx1024m .pdfbox.ExtractText <pdffile>


If this is still giving you an out of memory error then it is possibly 
an issue with PDFBox, if that is the case then please create an issue 
and attach/upload the PDF on the PDFBox site.


Ben



> Thanks.
> 
> I am using the getText(PDDocument) method of the PDFTextStripper. I 
will 
> try the other suggestion.
> 
> suba suresh.
> 
> Rob Staveley (Tom) wrote:
> > If you are using
> > 
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getTe
xt(o
> > rg.pdfbox.pdmodel.PDDocument), you are going to get a large String 
and may
> > need a 1G heap. 
> > 
> > If, however, you are using
> > 
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#write
Text
> > (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
temporary
> > file, you will not need so much RAM, but you need to use
> > 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.
html
> > #Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
field
> > (rather than
> > 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.
html
> > #Field(java.lang.String,%20java.lang.String,%
20org.apache.lucene.document.Fi
> > eld.Store,%20org.apache.lucene.document.Field.Index)).
> > 
> > -----Original Message-----
> > From: Suba Suresh [mailto:subas@wolfram.com] 
> > Sent: 13 July 2006 14:55
> > To: java-user@lucene.apache.org
> > Subject: Out of memory error
> > 
> > I am indexing different document formats with lucene 1.9. One of 
the pdf
> > file I am indexing is 300MG. Whenever the index writer hits that 
file it
> > stops the indexing with "Out of Memory" exception. I am using the 
pdf box
> > library to index. I have set the following merge factors in my code.
> > 
> > writer.setMergeFactor(1000);
> > writer.setMaxMergeDocs(9999999);
> > writer.setMaxBufferedDocs(1000);
> > writer.setMaxFieldLength(Integer.MAX_VALUE);
> > 
> > I would like any help and suggestions.
> > 
> > thanks,
> > suba suresh.
> > 
> > --------------------------------------------------------------------
-
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message