pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mehdi houshmand <med1...@gmail.com>
Subject Re: Merging multiple different subsets of the same font, or re-embedding font
Date Sun, 11 Dec 2011 20:05:43 GMT
Hi Craig,

We're looking into this exact same problem, I'll let you know if
anything comes of it.


On 11 December 2011 13:31, Craig Ringer <craig@postnewspapers.com.au> wrote:
> Hi folks
> I'm new here and to pdfbox - I've been looking at it mainly because it's
> used by Apache FOP to embed PDFs in other PDFs during XSL-FO typesetting.
> I'm using it to produce classified advertising pages, and I've run into a
> bit of a roadblock that Google and searching the fop and pdfbox mailing
> lists hasn't helped with.
> My documents contain 500-1000 small PDF files embedded as form XObjects into
> the master PDF file. Each of the original files has its fonts included as an
> embedded subset. Since many of the documents use different sets of glyphs
> from the same fonts, and the whole PDF is copied into the new document, I
> land up with hundreds of copies of common fonts like "Helvetica Bold
> (subset)" in the final document. A check with Acrobat Pro suggests that over
> 90% of the document's size is embedded fonts.
> What I'm looking for is a way to more intelligently merge the documents to
> reduce or eliminate this font duplication. I'd like to:
> - Embed a whole, non-subset copy of a font if its available locally, and
> then change all references in documents I'm including as XObjects so they
> refer to the new copy I've embedded (so long as the encodings match); or
> even better
> - As each document is embedded as an XObject into the main document, build a
> list of which glyphs its embedded fonts define. Don't import the font
> embeds, instead leave a dangling indirect reference to a font we're yet to
> define. When all documents are embedded, produce and embed a new subset
> using a local copy of the complete font, including only the glyphs that're
> actually used.
> Better again would be to extract all the embedded subsets and *combine*
> them, so I wouldn't need a local copy of the font. That's probably way too
> hard, though.
> I realise that I can never de-duplicate embedded subsets with different
> encodings. If there's "Helvetica Black" embedded 3 times, once each in
> WinAnsi, MacRoman and a custom encoding, there's no possible reduction
> without re-encoding the content streams, which is WAY beyond what I want to
> tackle. All I'm interested in is improving the case of 100 copies of
> "Helvetica Black (subset)" in WinAnsi, which I want to reduce to one
> slightly bigger embedded subset covering all the same glyphs or failing that
> a complete copy of the font.
> Ideas? Is this completely insane, or possibly practical?
> The docs for PDFBox offer nearly zero information on its font APIs, so I
> presume I need to go delving directly into the PDF font data structures to
> do any of this. I know the PDF format's low level structure quite well, but
> know nearly nothing about the embedded font formats or their encodings, so
> I'm *really* hoping PDFBox offers some helpers for fonts that just aren't
> referenced in the docs. Any tips?
> Is there anything built-in for creating custom font subsets given a glyph
> list? For unembedding fonts?
> Anybody tried anything like this already?
> Tips/suggestions?
> --
> Craig Ringer
> POST Newspapers
> 276 Onslow Rd, Shenton Park
> Ph: 08 9381 3088     Fax: 08 9388 2258
> ABN: 50 008 917 717
> http://www.postnewspapers.com.au/

View raw message