Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4E6B4200C24 for ; Thu, 23 Feb 2017 20:05:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4CFF0160B67; Thu, 23 Feb 2017 19:05:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 910A3160B3E for ; Thu, 23 Feb 2017 20:05:03 +0100 (CET) Received: (qmail 85579 invoked by uid 500); 23 Feb 2017 19:05:02 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Delivered-To: moderator for users@pdfbox.apache.org Received: (qmail 67078 invoked by uid 99); 23 Feb 2017 18:55:10 -0000 X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.101 X-Spam-Level: X-Spam-Status: No, score=-1.101 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-2.999, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 724258.65271.bm@omp1016.mail.ne1.yahoo.com X-YMail-OSG: 9ef4Ws4VM1n1WdBZS_0uYBA1RCqliH2adH4Ix6d._rgtfmAK2W2sem_wdwP_LvK osylFGRYC9M3a7gLuCprkQh1nwmorl7O4DwgtIIx9_B34X_ruTZf6aPidihG4h4AlvPCfK01Ixxl f9U62Nt_b64RUG986oRZg27e12519XdkfwLIAmDZUK1vT_adYXFkNdgQA86i_tzI4SJRJtc2u5rr bi2TAviuYrmE9NIHgGhIXw1z4hXPGFVDEuIuA8ZPeToyB3xVmbSYL7aB8oiQ2u1P2t2Hjp3oe6Iv AIpMradFRhbFkNpcLvaY0RccmQB2BfIjhSURiR1bFaArTWfJL9_zMoJt7ImVUmkmaKFGdn.fctFB 3_uswlivoJC8ni99EJjNeW9fI2WKogQTNvrrG56.WvN5ZAz6nKbwAHqOeuSp1yU52k6wIR0KH1Nz 1wgaAfcTED4lxWGmTlq3AC3ZYXWYzcKDUgkNPw2lo80oe9aaHajdOrmB53lF4s2Od9p1Whz2d_mI 6e_0xoGabco31kcNJu1SmrXTamNI- Date: Thu, 23 Feb 2017 18:54:19 +0000 (UTC) From: Viraf Bankwalla Reply-To: Viraf Bankwalla To: "users@pdfbox.apache.org" Message-ID: <151524595.3471006.1487876059752@mail.yahoo.com> In-Reply-To: <3140a206-576d-4088-bd78-485f28c6c3fd@t-online.de> References: <168441298.3196491.1487858877764.ref@mail.yahoo.com> <168441298.3196491.1487858877764@mail.yahoo.com> <3140a206-576d-4088-bd78-485f28c6c3fd@t-online.de> Subject: Re: OutOfMemoryException converting PDF to TIFF Images MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_3471005_1481140317.1487876059748" archived-at: Thu, 23 Feb 2017 19:05:04 -0000 ------=_Part_3471005_1481140317.1487876059748 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes, I saw that=C2=A0DefaultResourceCache uses SoftReference, however for s= ome reason when I look at the heap dump XObjects are prevalent. =C2=A0 The TIFF images are saved to file, using ImageIOUtil.write, and I simply ke= ep the File reference for downstream processing.=C2=A0 The change made to PDResources.isAllowedCache is: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 COSBase image =3D xobject.getCOSO= bject().getDictionaryObject(COSName.SUBTYPE);=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 if (image instanceof COSName && ((COSName) image).equals(COSN= ame.IMAGE))=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 {=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 return false;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 } After sending my earlier e-mail, I realized that I may not be able to add u= ser defined filter to PDResource (still figuring out the code), and that ma= ybe it would be added to=C2=A0DefaultResourceCache. I would like to be able to get a fix for the issue in the next release - th= ere may be a better fix to the problem than I have provided above, as I am = unfamiliar with the code base. Thanks - viraf From: Tilman Hausherr To: users@pdfbox.apache.org=20 Sent: Thursday, February 23, 2017 1:06 PM Subject: Re: OutOfMemoryException converting PDF to TIFF Images =20 Am 23.02.2017 um 15:07 schrieb viraf.bankwalla@yahoo.com.INVALID: > I am using PDFBox to convert PDF documents to a series of TIFF images (on= e for each page).=C2=A0 The implementation uses PDFRenderer to render each = page.=C2=A0 Things work fine when I am processing a single document in a si= ngle thread, however when I try to process multiple documents (each in its = own thread) I get an OutOfMemoryException. > In analyzing the heap dump, I see that this is caused by the images cache= d in DefaultResourceCache.=C2=A0 Objects are added added to the cache in PD= Resources, which includes a method private boolean isAllowedCache(PDXObject= xobject) that is used to determine whether an PDXObject can be cached.=C2= =A0 I have extended this to filter out COSName.IMAGE, and am now able to pr= ocess multiple documents in parallel. > I'd like to contribute this change back to the community.=C2=A0 However p= rior to adding this, I though some feedback on the filtering mechanism may = be appropriate.=C2=A0 Some options include: >=C2=A0 =C2=A0=20 >=C2=A0 =C2=A0 - Always exclude images >=C2=A0 =C2=A0 - Allow user to specify whether images should be cached or n= ot (add a method to PDResource to toggle filtering of images).=C2=A0 Defaul= t would including caching of images to be backwards compatible. >=C2=A0 =C2=A0 - Defer image caching decision to user through callback.=C2= =A0 Default callback would cache all images to provide backwards compatibil= ity. > > I also wanted to know how best to submit my patch for inclusion. > Thanks In theory, the cache should dump its contents when memory becomes low=20 thanks to the SoftReference. Thus I'm wondering if it doesn't work at=20 all, that's the real question. About your own code - are you "losing" any reference to the TIFF you=20 produce after each page? Or are they in an array of images? re submitting stuff, see=20 https://pdfbox.apache.org/codingconventions.html , open an issue in JIRA=20 and submit a .diff / .patch file. Be aware that we don't accept every=20 submission. I'm skeptical about callbacks, we don't do such a thing=20 anywhere. Tilman --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org =20 ------=_Part_3471005_1481140317.1487876059748--