pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Questions about PDFMergerUtility
Date Thu, 07 Dec 2017 18:51:26 GMT
There's a bug in merging:
https://stackoverflow.com/questions/47140209/files-flattened-and-merged-with-pdfbox-are-sharing-common-cosstream
https://issues.apache.org/jira/browse/PDFBOX-3999

If you don't have a structure tree, then you can close it early.

Tilman

Am 07.12.2017 um 19:24 schrieb David Fertig:
> I'm looking into merging multiple PDF files using more realistic memory/disk limits.
 For example, when merging 400 1-page files, PdfBox thinks it needs 30G of space.  This is
due to the way it segments the cache limits across all the input sources plus the output file,
with the output cache limited to the same size as each input file.  I've experimented with
2 easy modifications and one more involved modifications.
>
>    1.  Good: Split the cache in ½, give ½ to the output file, and segment the other
½ across the input files. (Still keeping them open until then end)
>    2.  Better: Split the cache in ½, give ½ to the output file, and ½ to the input
file, close each input file after merging.
>    3.  Best: Dynamically allocate in 16 page (64K) chucks from memory or disk on demand,
release cache as documents are closed after merge.
>
> All these approaches have reduced the memory limit requirements by 1-2 orders of  magnitude.
 While I realize this doesn't change the actual memory and disk space used, it allows the
limits to be a reasonable expectation of space used during the merge processes.
>
> I have one question.  Both #2 and #3 approaches close the input files right after being
merged and have no issues (in limited testing).  Is there a reason the current merge utility
keeps all the input files open during the merge and only closes them all at the end?  Closing
them after they are merged would save considerable cache space and reduce the need for so
many file handles as well.
>
> Thank you,
> David
> This email, including attachments, may contain information that is privileged, confidential
or is exempt from disclosure under applicable law (including, but not limited to, protected
health information). It is not intended for transmission to, or receipt by, any unauthorized
persons. If the reader of this message is not the intended recipient, or the employee or agent
responsible for delivering the message to the intended recipient, you are hereby notified
that any dissemination, distribution or copying of this communication is strictly prohibited.
If you believe this email was sent to you in error, do not read it. Please notify the sender
immediately informing them of the error and delete all copies and attachments of the message
from your system. Thank you.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message