From "Flynn, Peter" <pfl...@ucc.ie>
Subject Mismatch between XeLaTeX fontspec and Apache PDFBox
Date Thu, 25 Jan 2018 15:47:09 GMT
I have a very large number of bibliographic references in BiBTeX format which we need to make
available individually in formal reference formats within web pages (as HTML, not as embedded

I experimented a couple of years ago with Apache PDFBox and found that it could extract the
text from a PDF and preserve bold and italics. This would let us use LaTeX to typeset each
PDF in the required format, and then have PDFBox extract the text with bold and italics in
all the right places.

Regular pdflatex with old-style bibtex is insufficient, as it doesn't handle all the UTF-8
characters we need, and the reference formats supported are out of date; XeLaTeX with biblatex
and biber do all this just fine...but...

...if I do this using the fontspec package (the standard way to provide XeLaTeX with the font
data for handling UTF-8 diacritics), the output has all accented characters, but PDFBox doesn't
recognise the bold or italic. If I omit the fontspec package, PDFBox can get the bold and
italics, but XeLaTeX will omit the diacritics.

Examples of both PDFs and both HTML files are at http://epu.ucc.ie/latex/pdfbox-xelatex-fontspec-error.zip

As I don't know the internals either of fontspec or of PDFBox, I am hoping that someone on
the pdfbox mailing list or the comp.text.tex newsgroup may have a lead.


