pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicklas Karlsson <nicka...@gmail.com>
Subject Re: Fonts in pdf to image conversion
Date Wed, 04 Apr 2012 07:18:57 GMT
Thanks for the information. I continued my search for libraries and
stumbled on ICEpdf from ICEsoft and it works there so you could check for
hints in their source code while improving on PDFBox ;-)

On Wed, Apr 4, 2012 at 9:57 AM, Hamed Iravanchi <iravanchi@gmail.com> wrote:

> Hi Nicklas,
>
> I've been working on this issue for a while.
> Right now, PDFBox can not convert PDF files created by Open Office or Libre
> Office to images correctly.
> In my tests, PDF files created by Microsoft Word do not have this problem
> in the latest Trunk code.
>
> This is due to using extracted text to render the image, rather than using
> code points.
> Andreas used to reply my emails so we could collaborate and resolve such
> issues faster, but I haven't received any reply lately.
> I don't know if I'm posting in the right place or not thou...
>
> Anyway, to fix this issue for True Type fonts (which are typically used in
> your case) following things should be done by PDFBox:
> - It should use code points for all true type fonts, instead of extracted
> text
> - The code points should be mapped to glyph codes using the font's CMAP
> - Glyph codes should be used to draw text on the image.
>
> I just managed to fix this yesterday in my code for my sample PDF files, by
> modifying the trunk code.
> But I'm waiting for developer team to collaborate so that I can make sure
> what I'm doing is right and doesn't break other parts in PDFBox.
>
> -Hamed
>
>
> On Wed, Mar 28, 2012 at 11:15 AM, Nicklas Karlsson <nickarls@gmail.com
> >wrote:
>
> > Hi,
> >
> >  I'm using the latest LibreOffice to produce a PDF and the latest PDFBox
> > to extract the pages as images but I'm having some problems with the
> fonts.
> > If I use Times New Roman I get a
> >
> > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> > Changing font on <test> from <Times New Roman> to the default font
> >
> >  If I embed some more exotic fonts in the PDF, I get a
> >
> > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > unsupported/disabled operation: BMC
> > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > unsupported/disabled operation: EMC
> > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > unsupported/disabled operation: BDC
> > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> > Changing font on <test> from <Algerian> to the default font
> >
> > This is all on the same machine. Is there a special trick in getting the
> > fonts working?
> >
> > The extraction is done with something like
> >
> > PDDocument doc = PDDocument.load(pdf);
> > List pages = doc.getDocumentCatalog().getAllPages();
> > for (int i = 0; i < pages.size(); i++)
> > {
> > PDPage page = (PDPage) pages.get(i);
> > pics.add(page.convertToImage());
> > }
> >
> >
> > Thanks in advance,
> >  Nik
> >
> > --
> > ---
> > Nik
> >
>



-- 
---
Nik

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message