pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Determining why a PDF is large
Date Fri, 17 May 2019 03:23:25 GMT
Am 16.05.2019 um 22:29 schrieb Christopher Schultz:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> All,
>
> A simple tweak to the getFullUnicodeFont method to cache the loaded
> font made a huge difference. The resulting file is now only 20% of the
> original size when not embedding the same font over and over again.
>
> Just so I have things sorted in my own mind: each font used will still
> show on each page where it's used, right?


Yes!


> In the "smaller" file, I can
> still see the font mentioned on more than one page, but it's got the
> same "CID" and the same font name ("AAAROV+ArialUnicodeMS" -- no more
> "AAA???+ArialUnicodeMS" coming up multiple times with slightly
> different names).


Yes, that and the object number (e.g. "[29 0 R]").


>
> Of course, I'm also seeing the Type1 fonts show up repeated on
> multiple pages as well -- that's normal, right?

Yes. These are still the same object.



>
> Thanks,
> - -chris
>
> On 5/16/19 16:06, Christopher Schultz wrote:
>> Tilman,
>>
>> On 5/16/19 12:17, Tilman Hausherr wrote:
>>> PDFDebugger.
>>> Look at the resources. If the same font occurs several times,
>>> then you did something wrong. It should occur only once in a
>>> document.
>> Okay, it looks like it is indeed showing multiple times. Here's
>> what I can see in the document:
>>
>>> Page 1 Contents MediaBox Parent Resources (1) [8 0 R] Font (12)
>>> [15 0 R]
>> F1 (6) [19 0 R] /T:Font /S:Type0  (AAAGXI+ArialUnicodeMS) F10 (4)
>> [28 0 R] /T:Font /S:Type1 (Times-Italic) F11 (6) [29 0 R] /T:Font
>> /S:Type0 (AAABJI+ArialUnicodeMS) (9 more listed: 3 total type 1
>> fonts, 9 total type 0 fonts including those above) The font
>> AAA???I+ArialUnicodeMS shows up for all of the "type 0" entries .
>>
>>> Page 2 [...] Resources Font (3)
>> F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman) F2 (6) [31 0 R]
>> /T:Font /S:Type0 (AAAYGI+ArialUnicodeMS) F3 (4) [28 0 R] /T:Font
>> /S:Type1 (Times-Italic)
>>
>>> Page 3 [...] Resources Font (2)
>> F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman) F2 (4) [28 0 R]
>> /T:Font /S:Type1 (Times-Italic)
>>
>>> Page 4 [...] Resources Font (2)
>> F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman) F2 (4) [28 0 R]
>> /T:Font /S:Type1 (Times-Italic)
>>
>> So perhaps I am even using the built-in fonts incorrectly if they
>> are being mentioned on every page. Or is each page which uses a
>> font expected to have its own Font entry in the resources?
>>
>> Does this mean I am "adding" the font too many times somehow?
>>
>> My code looks like this:
>>
>> private void writeWrappedText(PDFont font, int fontSize, String
>> text, Color color) throws IOException { int paragraphWidth = 500;
>> boolean indented = false;
>>
>> String strippedText = sanitizeString(text); int start = 0; int end
>> = 0; int wrappedLineCnt = 1;
>>
>> if(!isAnsiEncoding(strippedText)) { if(logger.isDebugEnabled())
>> logger.debug("Text contains non-ansi characters: " + text);
>>
>> font = getFullUnicodeFont(); }
>>
>> for ( int i : getPossibleWrapPoints(strippedText) ) { float width
>> = font.getStringWidth(strippedText.substring(start,i)) / 1000 *
>> fontSize; if ( start < end && width > paragraphWidth ) { if
>> (wrappedLineCnt == 1) setOffsetX(getOffsetXforMargin());
>> printSanitizedLine(font, fontSize,
>> strippedText.substring(start,end), indented ? _pageIndent : 0,
>> color); wrappedLineCnt++; start = end; } end = i; } if
>> (wrappedLineCnt == 1) setOffsetX(getOffsetXforMargin()); // Last
>> piece of text printSanitizedLine(font, fontSize,
>> strippedText.substring(start), indented ? _pageIndent : 0, color);
>> }
>>
>> The getFullUnicodeFont method is:
>>
>> private PDFont getFullUnicodeFont() { if(null == _doc) throw new
>> IllegalStateException("Document has not yet been created; cannot
>> load a new font");
>>
>> InputStream in = null; try { String fullUnicodeFontFile =
>> "/resources/fonts/ARIALUNI.TTF" ; in =
>> getClass().getResourceAsStream(fullUnicodeFontFile); if(null ==
>> in) throw new MissingResourceException("Cannot load font file " +
>> fullUnicodeFontFile, this.getClass().getName(),
>> fullUnicodeFontFile);
>>
>> PDFont font = PDType0Font.load(_doc, in);
>>
>> return font; } catch (IOException ioe) { throw new
>> RuntimeException("Cannot load font", ioe); }
>>
>> }
>>
>> Re-reading that code, it's obvious that I should be storing the
>> font once loaded and re-using it. I'm guessing that
>> PDType0Font.load(PDDocument,InputStream) doesn't recognize that
>> the font has already been loaded and just adds it a second (or
>> third, etc.) time. Can anyone confirm that?


Yes!


Tilman


>>
>> I know that my code isn't the best in terms of only choosing to
>> render certain glyphs in this "full" font. I am working to improve
>> that, and I know there is example code for choosing the "best" font
>> for each character in a string, which I'll be reviewing
>> separately.
>>
>> Thanks, -chris
>>
>>> Am 16.05.2019 um 18:09 schrieb Christopher Schultz: All,
>>> We have a process that generates PDF documents usually using the
>>>   default Type-1 built-in fonts, so the documents do not embed
>>> the font information.
>>> We recently added the ability for the documents to include font
>>> information if certain glyphs were not available in the default
>>> font(s) and, as expected, the file sizes end up being bigger
>>> when that happens.
>>> What is the best tool to look at a particular document to see
>>> why it ended up being so large? I'm not sure I can visually tell
>>> by looking at the document which character triggered the
>>> inclusion of the font, and then why that font was used for what I
>>> can only assume was a lot of text. By inspecting the file, I'm
>>> sure I can improve my code so that we have fewer uses of this
>>> additional font and therefore keep the file sizes to a minimum.
>>> Thanks, -chris
>>>> --------------------------------------------------------------------
> - -
>>>>
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>
>>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>> ---------------------------------------------------------------------
>>
>>
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlzdyDoACgkQHPApP6U8
> pFi+/g//Vpbt1+fgt2MGCoVgXXJKKfFGRa6rd1M+/V7klJGlgWPFBiF5GVxrlYTi
> uXvUQx6/3eqSc59/EWoECprP7HcAiVKnr4ji6x5weylb053TYGydQu5vSzzFeDRs
> /RWu/2hiIv1vPhdIidFDNwzwnz0f1ZjCCMIgLikJw4ezsr6DLrWpt/tfLy6J889s
> x05ep3yxljFhTsyELwDACVDLUzqEovSYOfjczDq4kZc99OLxp6hz37w1bo0xo3DH
> PzNIKJiUvByT36hs2sEUgpKuPOBzy4n8JeOXVY9YzDBNlCv/DpKv9ecVk9VfOCFb
> 9Du7wBUBvGbCmbEDlKbHqBeYWmtl++ors1cT8helGx8djtWFBiV59Jauh5OA/qzZ
> mRDCQK08uuLZDQ6F7pelwlnleIIrJdz5ccSK5JuTUTcKXZt+Hpk/lKB58lBiySgF
> vl7WVFHncuQT1VxbLbjqKlO8ehoyt7DiMzKCl/hpwEiLlSlD3pX0pwstkGV8MlyQ
> VvtUh5Crw6lVPjjI/g8ReldzVstzV1C7U+VexRbPYy/eCrK0RavQJWTrKe7SMt4j
> wognlbSi+r8AEXXupiudzF4uyqbJo6frFFacKktqqz6Vi81qFPIIIrIJcXC7vTbf
> 7T65KAOIgDWGECqSPzW57Ql5y3a/UefMUagQDCHUQk8hY7q7bCs=
> =m3yA
> -----END PGP SIGNATURE-----



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message