Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48984C329 for ; Thu, 5 Jun 2014 12:44:02 +0000 (UTC) Received: (qmail 43047 invoked by uid 500); 5 Jun 2014 12:44:02 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 43026 invoked by uid 500); 5 Jun 2014 12:44:02 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 43015 invoked by uid 99); 5 Jun 2014 12:44:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2014 12:44:01 +0000 Date: Thu, 5 Jun 2014 12:44:01 +0000 (UTC) From: "Tilman Hausherr (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-2101?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D140= 18745#comment-14018745 ]=20 Tilman Hausherr commented on PDFBOX-2101: ----------------------------------------- I don't think that this belongs in PDFRenderer. I'd rather add a clear() me= thod to PDPage that does this call. [~lehmi] WDYT ? > Surprising memory consumption when extracting images > ---------------------------------------------------- > > Key: PDFBOX-2101 > URL: https://issues.apache.org/jira/browse/PDFBOX-2101 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.8.5 > Environment: Windows 7 > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) > Reporter: Tim Allison > Assignee: Andreas Lehmk=C3=BChler > Priority: Minor > Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-71= 4-poor.jpg, java.hprof.zip > > > ExtractImages seems to fail to release memory resources on some files in = both PDFBox 1.8.5 and trunk. =20 > On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/= 239/239665.pdf], if extracting every image on every page (and there are man= y, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx= and there is > 2.5g available, ExtractImages will work. > With some experimentation, the triggers seem to be JPEG images that have = masks. I'm not sure, though, whether the issue is with PDFBox or Java. > Commandlines: > 1.8.5: > java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 2396= 65.pdf > 2.0_SNAPSHOT: > java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.Ext= ractImages -addkey 239665.pdf > Results: > 1.8.5: 906 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:= 113) > at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputSt= ream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java= :140) > at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStrea= m.java: > 514) > at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBIm= age(PDP > ixelMap.java:217) > at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2Ou= tputStr > eam(PDPixelMap.java:363) > at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.writ= e2file( > PDXObjectImage.java:254) > at org.apache.pdfbox.ExtractImages.processResources(ExtractImages= .java:2 > 02) > at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.ja= va:160) > at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) > {noformat} > 2.0_SNAPSHOT: 428 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:= 113) > at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputSt= ream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java= :140) > at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) > at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) > at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.fr= om8bit( > SampledImageReader.java:171) > at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.ge= tRGBIma > ge(SampledImageReader.java:154) > at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getIma= ge(PDIm > ageXObject.java:171) > at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages= .java:2 > 31) > at org.apache.pdfbox.tools.ExtractImages.processResources(Extract= Images. > java:206) > at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractIma= ges.jav > a:164) > at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:= 69) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)