pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Limit PDF size
Date Tue, 01 Aug 2017 20:42:22 GMT
Am 01.08.2017 um 22:09 schrieb Christopher Schultz:
> Hash: SHA256
> Tilman,
> On 8/1/17 3:22 PM, Tilman Hausherr wrote:
>> The only thing that comes close to what you want is to create your
>> PDDocument with MemoryUsageSetting.setupMixed(...) as parameter.
> So that we can buffer to disk if the in-memory representation gets too
> big? That sounds like a good approach, and probably the most useful to m
> e.
> It also appears that I can set a maximum in-memory limit like this:
> MemoryUsageSetting mus = MemoryUsageSetting.setupMainMemoryOnly(1 *
> 1024 * 1024);
> PDDocument doc = new PDDocument(mus);

Yes. Although this would mean you'd get an exception if you use more. 
That's why I recommend the mixed one. You could use the memory limit for 
stress tests, i.e. create the "worst" possible file and see what you need.

Note that only streams are cached. Ordinary java structures (e.g. maps, 
numbers, strings) are not.

> ... and then this should enforce a 1MiB size limit, no? I think that's
> all I want... there shouldn't be any reason for me to have to touch
> the disk: my files are really quite small. I just don't want something
> to go wrong with my client code and inadvertently go into an infinite
> loop adding "Hello World" to the document over and over until I have
> 50k pages in the PDF and an OOME on my hands.
>> What you should do is to care to not have anything duplicate. So if
>> you have a company logo on every page, create your object object
>> only once. Same for fonts.
> We have something like:
> private Font _theFont;
> ...
> contentStream.setFont(_theFont);
> contentStream.newLineAtOffset(x,y);
> contentStream.showText("Hello, world");
> ...
> Many many times. The Font object reference stays the same, so I'm
> guessing that's okay and the font is used once and referenced many
> times, right?


To create small PDF files, use PDType0Font.load() instead of 
PDTrueTypeFont.load(), this will subset the fonts after saving.

>> And try to have only one content stream per page. (We recently had
>> a guy who had a huge number of content streams and wondered why his
>> PDF was so big).
> Check: we have only one PDPageContentStream per page.
> We have a single logo on the first page and nothing repeated.
> Our PDFs are almost 100% plain-text with lots of whitespace (which
> doesn't count, I know). When base64 encoded, they are typically only a
> few kb in size.
> I'm mostly operating from a position of borderline unhealthy paranoia,
> but I'd rather have a bit of code added to ensure that I don't have to
> get paged in the middle of the night to restart a service that has
> suffered an OOME.

This all sounds harmless. All the memory problems I can think of were 
related to rendering, not PDF creation.

We've had a least one speed complaint, but that one is solved in the 
current version.


> Thanks for the pointers.
> - -chris
>> Am 01.08.2017 um 20:04 schrieb Christopher Schultz: All,
>> We use PDFBox on a server that must handle many transactions with
>> (somewhat) limited memory. I'd like to limit the amount of memory
>> used to generate our PDFs, which we then serialize to byte-array,
>> base64-encode, etc. for ultimate delivery to some endpoint.
>> I can obviously limit the number of bytes produced by using a
>> size-limited OutputStream passed-into
>> PDDocument.save(OutputStream), but I'm wondering if PDFBox has any
>> facilities within it to limit the size of the object-tree in memory
>> (or estimate its size, and we can stop operations when it reaches a
>> certain size) so that we don't end up with a multi-GB object-tree
>> that then fails to serialize to byte[] because it is too big.
>> We are building our PDF documents from scratch, starting with the
>> page definitions, fonts, etc. then adding titles, paragraphs of
>> text, etc. It's all fairly straightforward, and we have full
>> control over the whole process up to and including the call to
>> PDDocument.save(OutputStream).
>> We are manually constructing our pages as well, so I suppose we
>> could simply limit the number of pages, but I'm more concerned
>> about the size of the memory used and not the number of pages.
>> Is there anything in PDFBox that can help us with this? We can
>> always count e.g. the number of bytes/characters we have written to
>> the PDF, but that seems less important than what is going on inside
>> of the PDF structure itself.
>> -chris
>>> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> qe36r5yqUc9TMmCa7cunJuLJxMAnH6UnbNzNJm4IChMXmtLk++uF9YMKpPN0irQr
> RxAaNlUbNpnyJqXR/W/7ZTVo4gP2l7JYQqARcSLjxuROLqALF1jp8BoXMw0Zz8L4
> rfEub/dVk3EIBvg+ithGeqzzb67yoPEbCP9LVsXoxyvrTER1mB28BmmSZsw2hVD5
> HLKzmu3e4XLXdi+MKBfJfF0Y+S4/7/yq+4f0KBq/AD7VlNeUwOv6j0kiVkT5Tdv/
> tJGtheC1M6dXVLqQD7/G/q37/kdgCeG12yTbpw8FUMbfn4yHrtd8Fqmxz6au8qpm
> Fu0xhGy1SobxiGXgpFCNED0fdGz0f56TYFPb8KgtAveHZuoPlDcyq9WdDThRl/zn
> Oxs1ytkFf4W0RbdNcR/wtQLxVUVbPUuNE5gFKqNf282H7fj5q/I3cyCmafUnecz0
> bjcHfCS4EpciYnfJT1OihRGDGBXSHZfwXEqFva8hyQ5cRLWuyqsz8Ii2DaiLoe4g
> Y8pP3/dWNV5SgtQxrgVAScry10G06ybIoYj9rXz/QW6a30Hj4Dt2bFrr/n/FS1L9
> G3qtsg41hXRMXT5Oly0WzgYv+fwfNCO3pJ4MB7dpuNHcTsi1Jp/capK7oA5aKqEn
> bo9GBaEOciUoVYbP1vb+
> =F6Jq
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message