pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Eyebrow-raising memory consumption exporting PDXObjectImages in PDFBox 1.8
Date Sun, 25 May 2014 14:25:28 GMT
Am 24.05.2014 12:23, schrieb Tilman Hausherr:
> I looked at the source code of 1.8.5: image.write2OutputStream() results in the
> BufferedImage object kept in memory as long as the "PDXObjectImage image" object
> exists. So the question is whether these become orphans after a page is done, or
> not.
>
> The "resources" object, is it the one of a single page, or of the whole PDF file?
IMHO it has to be an issue with the TIKA code. I ran a quick test using the 
ExtractImages class with default memory settings (OpenJDK 1.7.0_55 64bit on 
fedora 20). I got 2750 images and no OOM.

> Tilman
>
>
> Am 23.05.2014 18:18, schrieb Allison, Timothy B.:
>> I get an OOM when trying to write the embedded images to disk with straight
>> PDFBox (no Tika) with -Xmx2g (tested on Java 1.7).  My write method is per
>> image; I'm not caching anything in memory.
>>
>> <snippet>
>>      for (PDXObject object : resources.getXObjects().values()) {
>>        if (object instanceof PDXObjectForm) {
>>          extractImages(((PDXObjectForm) object).getResources());
>>        } else if (object instanceof PDXObjectImage) {
>>          PDXObjectImage image = (PDXObjectImage) object;
>>
>>          OutputStream os = new FileOutputStream(new File(outDir,
>> total+"."+image.getSuffix()));
>>
>>          image.write2OutputStream(os);
>>          os.flush();
>>          os.close();
>>       }
>> </snippet>
>

BR
Andreas Lehmkühler

Mime
View raw message