Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6DF1E200D51 for ; Fri, 8 Dec 2017 00:11:38 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 6C826160C1E; Thu, 7 Dec 2017 23:11:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B2BF6160C0C for ; Fri, 8 Dec 2017 00:11:37 +0100 (CET) Received: (qmail 28304 invoked by uid 500); 7 Dec 2017 23:11:36 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 28293 invoked by uid 99); 7 Dec 2017 23:11:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Dec 2017 23:11:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E8DE7C1064 for ; Thu, 7 Dec 2017 23:11:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.97 X-Spam-Level: X-Spam-Status: No, score=0.97 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id I-6pq2hhg5vF for ; Thu, 7 Dec 2017 23:11:33 +0000 (UTC) Received: from mailout02.t-online.de (mailout02.t-online.de [194.25.134.17]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 95E9962A16 for ; Thu, 7 Dec 2017 18:48:56 +0000 (UTC) Received: from fwd27.aul.t-online.de (fwd27.aul.t-online.de [172.20.26.132]) by mailout02.t-online.de (Postfix) with SMTP id 19A8B41AEA40 for ; Thu, 7 Dec 2017 19:48:56 +0100 (CET) Received: from [192.168.2.105] (XHwPcMZUrhGau+LvluOkrGTOlkxEdGGOhX8laLlloMK+u3g4IJyW0m8LWVOvce3w8S@[217.231.137.157]) by fwd27.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1eN1E3-3ILoPI0; Thu, 7 Dec 2017 19:48:51 +0100 Subject: Re: Questions about PDFMergerUtility To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <407bc659-11bf-1184-61d0-7b18c5c4d4bc@t-online.de> Date: Thu, 7 Dec 2017 19:51:26 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-ID: XHwPcMZUrhGau+LvluOkrGTOlkxEdGGOhX8laLlloMK+u3g4IJyW0m8LWVOvce3w8S X-TOI-MSGID: 1e7b06f8-6635-4969-8394-92d2a97b56f3 archived-at: Thu, 07 Dec 2017 23:11:38 -0000 There's a bug in merging: https://stackoverflow.com/questions/47140209/files-flattened-and-merged-with-pdfbox-are-sharing-common-cosstream https://issues.apache.org/jira/browse/PDFBOX-3999 If you don't have a structure tree, then you can close it early. Tilman Am 07.12.2017 um 19:24 schrieb David Fertig: > I'm looking into merging multiple PDF files using more realistic memory/disk limits. For example, when merging 400 1-page files, PdfBox thinks it needs 30G of space. This is due to the way it segments the cache limits across all the input sources plus the output file, with the output cache limited to the same size as each input file. I've experimented with 2 easy modifications and one more involved modifications. > > 1. Good: Split the cache in �, give � to the output file, and segment the other � across the input files. (Still keeping them open until then end) > 2. Better: Split the cache in �, give � to the output file, and � to the input file, close each input file after merging. > 3. Best: Dynamically allocate in 16 page (64K) chucks from memory or disk on demand, release cache as documents are closed after merge. > > All these approaches have reduced the memory limit requirements by 1-2 orders of magnitude. While I realize this doesn't change the actual memory and disk space used, it allows the limits to be a reasonable expectation of space used during the merge processes. > > I have one question. Both #2 and #3 approaches close the input files right after being merged and have no issues (in limited testing). Is there a reason the current merge utility keeps all the input files open during the merge and only closes them all at the end? Closing them after they are merged would save considerable cache space and reduce the need for so many file handles as well. > > Thank you, > David > This email, including attachments, may contain information that is privileged, confidential or is exempt from disclosure under applicable law (including, but not limited to, protected health information). It is not intended for transmission to, or receipt by, any unauthorized persons. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you believe this email was sent to you in error, do not read it. Please notify the sender immediately informing them of the error and delete all copies and attachments of the message from your system. Thank you. > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org