pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Limit PDF size
Date Tue, 01 Aug 2017 23:17:32 GMT
Hash: SHA256


On 8/1/17 4:42 PM, Tilman Hausherr wrote:
> Am 01.08.2017 um 22:09 schrieb Christopher Schultz: Tilman,
> On 8/1/17 3:22 PM, Tilman Hausherr wrote:
>>>> The only thing that comes close to what you want is to create
>>>> your PDDocument with MemoryUsageSetting.setupMixed(...) as
>>>> parameter.
> So that we can buffer to disk if the in-memory representation gets
> too big? That sounds like a good approach, and probably the most
> useful to m e.
> It also appears that I can set a maximum in-memory limit like
> this:
> MemoryUsageSetting mus = MemoryUsageSetting.setupMainMemoryOnly(1
> * 1024 * 1024); PDDocument doc = new PDDocument(mus);
>> Yes. Although this would mean you'd get an exception if you use
>> more. That's why I recommend the mixed one. You could use the
>> memory limit for stress tests, i.e. create the "worst" possible
>> file and see what you need.

I think I'm okay with an exception in these cases. As I said, our PDFs
only end up being a few kiB in size, so I've put a 1MiB cap on the
memory-only memory usage strategy for the time being.

I'm curious about what's being constrained, here... does PDFBox
estimate its current memory-usage of various PD* objects in memory and
push to disk when that's exceeded, or does it just limit the amount of
memory that gets used when serializing out to a stream.

>> Note that only streams are cached. Ordinary java structures (e.g.
>> maps, numbers, strings) are not.

Can you tell me a little more about that? When you say "streams are
cached", what does that mean exactly?

Or have I essentially already asked that question above?

> ... and then this should enforce a 1MiB size limit, no? I think
> that's all I want... there shouldn't be any reason for me to have
> to touch the disk: my files are really quite small. I just don't
> want something to go wrong with my client code and inadvertently go
> into an infinite loop adding "Hello World" to the document over and
> over until I have 50k pages in the PDF and an OOME on my hands.
>>>> What you should do is to care to not have anything duplicate.
>>>> So if you have a company logo on every page, create your
>>>> object object only once. Same for fonts.
> We have something like:
> private Font _theFont;
> ... contentStream.setFont(_theFont); 
> contentStream.newLineAtOffset(x,y); contentStream.showText("Hello,
> world"); ...
> Many many times. The Font object reference stays the same, so I'm 
> guessing that's okay and the font is used once and referenced many 
> times, right?
>> Yes!
>> To create small PDF files, use PDType0Font.load() instead of 
>> PDTrueTypeFont.load(), this will subset the fonts after saving.

We are using PDType1Font.FONTNAME for everything, so we aren't calling
.load for anything at all.

>>>> And try to have only one content stream per page. (We
>>>> recently had a guy who had a huge number of content streams
>>>> and wondered why his PDF was so big).
> Check: we have only one PDPageContentStream per page.
> We have a single logo on the first page and nothing repeated.
> Our PDFs are almost 100% plain-text with lots of whitespace (which 
> doesn't count, I know). When base64 encoded, they are typically
> only a few kb in size.
> I'm mostly operating from a position of borderline unhealthy
> paranoia, but I'd rather have a bit of code added to ensure that I
> don't have to get paged in the middle of the night to restart a
> service that has suffered an OOME.
>> This all sounds harmless. All the memory problems I can think of
>> were related to rendering, not PDF creation.

Sounds good.

>> We've had a least one speed complaint, but that one is solved in
>> the current version.

I'll make sure we are up-to-date.

- -chris
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/


To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message