pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: PDFBOX-948
Date Thu, 01 Dec 2011 10:27:16 GMT
Am 25.11.2011 18:30, schrieb P Williams:
> Hi All,
>
> It appears that I am the victim of the caviate stated in
> PDFBOX-948<https://issues.apache.org/jira/browse/PDFBOX-948>
> :
>
> *"For normal sized PDFs files, the in-memory implementation
> RandomAccessBuffer should not increase the memory usage too much, while
> providing faster IO as all access operations are only memory copies.
>
> Therefore, please consider switching the default to in-memory scratch
> buffers. Users with very large files can still pass a temporary directory."*
> *
> *
> I'm using apache-solr-4.0-2011-10-14_08-56-59 snapshot, which uses Tika
> 0.10 which uses PDFBox 1.6.0 and getting the following out of memory error:
>
> Caused by: java.lang.OutOfMemoryError: Java heap space
>          at
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
>          at
> org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
>          at
> org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
>          at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>          at java.io.BufferedOutputStream.write(Unknown Source)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:294)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:391)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:363)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:337)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:177)
>          at
> org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:257)
>          at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1325)
>          at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:796)
>          at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:84)
>          at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>          at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:671)
>
> At which stage should I be able to pass a temporary directory?  Would it
> need to be a Tika configuration?  Or does there need to be something
> changed in PDFBox to even enable this option?
There are 2 possible ways to provide a temp directory/scratch file:

- pass an instance of RandomAccessFile to the PDDocument.load method you are 
using (I'm not familiar with the TIKA details, but I guess that one is used)
- if the PDFParser is used directly you should either pass an instance of 
RandomAccessFile to the constructor or set the temp-dir using setTempDirectory

> I have been eager to get Tika 0.10 with Solr because it solves a full text
> garbling issue solved by TIKA-611, unfortunately it has introduced new
> issues with my indexing  workflow that appear to originate in pdfbox 1.6.0.
>
> Thanks,
> Tricia


BR
Andreas Lehmkühler


Mime
View raw message