Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D3D111271 for ; Sat, 24 May 2014 06:05:04 +0000 (UTC) Received: (qmail 9047 invoked by uid 500); 24 May 2014 06:05:04 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 9039 invoked by uid 500); 24 May 2014 06:05:03 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 9031 invoked by uid 99); 24 May 2014 06:05:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 May 2014 06:05:03 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [194.25.134.21] (HELO mailout10.t-online.de) (194.25.134.21) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 May 2014 06:04:58 +0000 Received: from fwd10.aul.t-online.de (fwd10.aul.t-online.de [172.20.26.152]) by mailout10.t-online.de (Postfix) with SMTP id 3A348550E63 for ; Sat, 24 May 2014 08:04:37 +0200 (CEST) Received: from [192.168.2.102] (Xpmhg0Zvwh4DdhteUQGbAFlPa-pIM+BKHZWtk50po+b2ljFJjR2HanbXlxSgIu1w4M@[84.190.182.242]) by fwd10.t-online.de with esmtp id 1Wo54L-05pbJQ0; Sat, 24 May 2014 08:04:33 +0200 Message-ID: <5380369B.6000703@t-online.de> Date: Sat, 24 May 2014 08:05:15 +0200 From: Tilman Hausherr User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: users@pdfbox.apache.org Subject: Re: Eyebrow-raising memory consumption exporting PDXObjectImages in PDFBox 1.8 References: <1D06A081892ADF4589BD83EE24B9DC302593D342@IMCMBX02.MITRE.ORG> In-Reply-To: <1D06A081892ADF4589BD83EE24B9DC302593D342@IMCMBX02.MITRE.ORG> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-ID: Xpmhg0Zvwh4DdhteUQGbAFlPa-pIM+BKHZWtk50po+b2ljFJjR2HanbXlxSgIu1w4M X-TOI-MSGID: 4bee52b6-5df6-40b2-b29f-705509d580cf X-Virus-Checked: Checked by ClamAV on apache.org Hi Tim, I'd recommend to open new issues anyway. Especially the 2nd part. Please mention ONE file that is corrupt, and attach it, and the code you are using. Note that you will need jai_imageio to write tiff images. Tilman Am 23.05.2014 18:18, schrieb Allison, Timothy B.: > All, > Over on Tika, we recently added the ability to export PDXObjectImages (TIKA-1268) as we do now with regular attachments. Some users have noticed some eyebrow-raising memory consumption after we made the change with some files. We're currently using PDFBox 1.8.5. > > This 4MB file shows the issue: > http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf > > I get an OOM when trying to write the embedded images to disk with straight PDFBox (no Tika) with -Xmx2g (tested on Java 1.7). My write method is per image; I'm not caching anything in memory. > > > for (PDXObject object : resources.getXObjects().values()) { > if (object instanceof PDXObjectForm) { > extractImages(((PDXObjectForm) object).getResources()); > } else if (object instanceof PDXObjectImage) { > PDXObjectImage image = (PDXObjectImage) object; > > OutputStream os = new FileOutputStream(new File(outDir, total+"."+image.getSuffix())); > > image.write2OutputStream(os); > os.flush(); > os.close(); > } > > > > From some exploration, I think the memory issue is only triggered by JPEG images and only by those that have masks. > > Relevant chunks from HPROF: > > SITES BEGIN (ordered by live bytes) Fri May 23 12:11:47 2014 > percent live alloc'ed stack class > rank self accum bytes objs bytes objs trace name > 1 59.53% 59.53% 559806264 340 656228560 546 305411 byte[] > 2 11.44% 70.97% 107538248 298 107538248 298 302226 byte[] > 3 6.41% 77.37% 60253600 3674 60253600 3674 302153 byte[] > 4 4.56% 81.93% 42852888 26 42852888 26 304263 byte[] > 5 3.61% 85.54% 33926664 1413611 2349553848 97898077 305437 byte[] > 6 3.61% 89.15% 33926664 1413611 2349553848 97898077 305441 byte[] > > > TRACE 305411: > java.awt.image.DataBufferByte.(DataBufferByte.java:76) > java.awt.image.Raster.createInterleavedRaster(Raster.java:266) > java.awt.image.Raster.createInterleavedRaster(Raster.java:212) > java.awt.image.ComponentColorModel.createCompatibleWritableRaster(ComponentColorModel.java:2825) > > I think this might be a memory issue in Java itself, not PDFBox. However, has anyone else seen this? Is there an open issue in PDFBox or Java for this? See P.P.S. for testing with PDFBox trunk. > > Thank you. > > Best, > > Tim > > P.S. Same behavior whether I use the old parser or the non-sequential parser. > > P.P.S. When I tried roughly the same code with PDFBox trunk, the memory profile was gorgeous (130m), but the files were all corrupt (user error?). With that run there were 2,750 files totalling 2.6gb. >