pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
Date Thu, 27 Dec 2018 09:43:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729482#comment-16729482

ASF subversion and git services commented on PDFBOX-4188:

Commit 1849793 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1849793 ]

PDFBOX-4182, PDFBOX-4188: remove unused parameter

>  "Maximum allowed scratch file memory exceeded." Exception when merging large number
of small PDFs
> --------------------------------------------------------------------------------------------------
>                 Key: PDFBOX-4188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4188
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9, 3.0.0 PDFBox
>            Reporter: Gary Potagal
>            Priority: Major
>         Attachments: PDFBOX-4188-MemoryManagerPatch.zip, PDFBOX-4188-breakingTest.zip,
PDFBOX-4188_memory_diagram.png, PDFMergerUtility.java-20180412.patch
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
> We wanted to address one more merge issue in org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory and disk
for cache.  Initially, we would often get "Maximum allowed scratch file memory exceeded.",
unless we turned off the check by passing "-1" to org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. 
I believe, this is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that instead of sharing
a single cache, it breaks it up into equal sized fixed partitions based on the number of input
+ output files being merged.  This means that each partition must be big enough to hold the
final output file.  When 400 1-page files are merged, this creates 401 partitions, but each
of which needs to be big enough to hold the final 400 pages.  Even worse, the merge algorithm
needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page input files,
and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 400-pages, or
160,400 pages in total when specifying "maxStorageBytes".  This would be a very high number,
usually in GIGs.
> Given the current limitation that we need to keep all the input files open until the
output file is written (HUGE), we came up with 2 options.  (See PDFBOX-4182)  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the other ½
across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on demand,
release cache as documents are closed after merge.  This is our current implementation till
PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are addressed.
> We would like to submit our current implementation as a Patch to 2.0.10 and 3.0.0, unless
this is already addressed.
>  Thank you

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message