pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Eyebrow-raising memory consumption exporting PDXObjectImages in PDFBox 1.8
Date Sat, 24 May 2014 06:05:15 GMT
Hi Tim,

I'd recommend to open new issues anyway. Especially the 2nd part. Please 
mention ONE file that is corrupt, and attach it, and the code you are using.

Note that you will need jai_imageio to write tiff images.

Tilman

Am 23.05.2014 18:18, schrieb Allison, Timothy B.:
> All,
>    Over on Tika, we recently added the ability to export PDXObjectImages (TIKA-1268)
as we do now with regular attachments.  Some users have noticed some eyebrow-raising memory
consumption after we made the change with some files.  We're currently using PDFBox 1.8.5.
>
> This 4MB file shows the issue:
> http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf
>
> I get an OOM when trying to write the embedded images to disk with straight PDFBox (no
Tika) with -Xmx2g (tested on Java 1.7).  My write method is per image; I'm not caching anything
in memory.
>
> <snippet>
>      for (PDXObject object : resources.getXObjects().values()) {
>        if (object instanceof PDXObjectForm) {
>          extractImages(((PDXObjectForm) object).getResources());
>        } else if (object instanceof PDXObjectImage) {
>          PDXObjectImage image = (PDXObjectImage) object;
>
>          OutputStream os = new FileOutputStream(new File(outDir, total+"."+image.getSuffix()));
>
>          image.write2OutputStream(os);
>          os.flush();
>          os.close();
>       }
> </snippet>
>
>
>  From some exploration, I think the memory issue is only triggered by JPEG images and
only by those that have masks.
>
> Relevant chunks from HPROF:
>
> SITES BEGIN (ordered by live bytes) Fri May 23 12:11:47 2014
>            percent          live          alloc'ed  stack class
> rank   self  accum     bytes objs     bytes  objs trace name
>      1 59.53% 59.53% 559806264  340 656228560   546 305411 byte[]
>      2 11.44% 70.97% 107538248  298 107538248   298 302226 byte[]
>     3  6.41% 77.37%  60253600 3674  60253600  3674 302153 byte[]
>      4  4.56% 81.93%  42852888   26  42852888    26 304263 byte[]
>      5  3.61% 85.54%  33926664 1413611 2349553848 97898077 305437 byte[]
>      6  3.61% 89.15%  33926664 1413611 2349553848 97898077 305441 byte[]
>
>
> TRACE 305411:
>                  java.awt.image.DataBufferByte.<init>(DataBufferByte.java:76)
>                  java.awt.image.Raster.createInterleavedRaster(Raster.java:266)
>                  java.awt.image.Raster.createInterleavedRaster(Raster.java:212)
>                  java.awt.image.ComponentColorModel.createCompatibleWritableRaster(ComponentColorModel.java:2825)
>
> I think this might be a memory issue in Java itself, not PDFBox.  However, has anyone
else seen this?  Is there an open issue in PDFBox or Java for this?  See P.P.S. for testing
with PDFBox trunk.
>
>    Thank you.
>
>            Best,
>
>                    Tim
>
> P.S. Same behavior whether I use the old parser or the non-sequential parser.
>
> P.P.S. When I tried roughly the same code with PDFBox trunk, the memory profile was gorgeous
(130m), but the files were all corrupt (user error?).  With that run there were 2,750 files
totalling 2.6gb.
>


Mime
View raw message