pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viraf Bankwalla <viraf_bankwa...@yahoo.com.INVALID>
Subject Re: OutOfMemoryException converting PDF to TIFF Images
Date Thu, 23 Feb 2017 18:54:19 GMT
Yes, I saw that DefaultResourceCache uses SoftReference, however for some reason when I look
at the heap dump XObjects are prevalent.  
The TIFF images are saved to file, using ImageIOUtil.write, and I simply keep the File reference
for downstream processing. 
The change made to PDResources.isAllowedCache is:
            COSBase image = xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE); 
          if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE)) 
          {             return false;            }
After sending my earlier e-mail, I realized that I may not be able to add user defined filter
to PDResource (still figuring out the code), and that maybe it would be added to DefaultResourceCache.
I would like to be able to get a fix for the issue in the next release - there may be a better
fix to the problem than I have provided above, as I am unfamiliar with the code base.
- viraf

      From: Tilman Hausherr <THausherr@t-online.de>
 To: users@pdfbox.apache.org 
 Sent: Thursday, February 23, 2017 1:06 PM
 Subject: Re: OutOfMemoryException converting PDF to TIFF Images
Am 23.02.2017 um 15:07 schrieb viraf.bankwalla@yahoo.com.INVALID:
> I am using PDFBox to convert PDF documents to a series of TIFF images (one for each page). 
The implementation uses PDFRenderer to render each page.  Things work fine when I am processing
a single document in a single thread, however when I try to process multiple documents (each
in its own thread) I get an OutOfMemoryException.
> In analyzing the heap dump, I see that this is caused by the images cached in DefaultResourceCache. 
Objects are added added to the cache in PDResources, which includes a method private boolean
isAllowedCache(PDXObject xobject) that is used to determine whether an PDXObject can be cached. 
I have extended this to filter out COSName.IMAGE, and am now able to process multiple documents
in parallel.
> I'd like to contribute this change back to the community.  However prior to adding this,
I though some feedback on the filtering mechanism may be appropriate.  Some options include:
>    - Always exclude images
>    - Allow user to specify whether images should be cached or not (add a method to
PDResource to toggle filtering of images).  Default would including caching of images to
be backwards compatible.
>    - Defer image caching decision to user through callback.  Default callback would
cache all images to provide backwards compatibility.
> I also wanted to know how best to submit my patch for inclusion.
> Thanks

In theory, the cache should dump its contents when memory becomes low 
thanks to the SoftReference. Thus I'm wondering if it doesn't work at 
all, that's the real question.

About your own code - are you "losing" any reference to the TIFF you 
produce after each page? Or are they in an array of images?

re submitting stuff, see 
https://pdfbox.apache.org/codingconventions.html , open an issue in JIRA 
and submit a .diff / .patch file. Be aware that we don't accept every 
submission. I'm skeptical about callbacks, we don't do such a thing 


To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message