pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Retter <adam.ret...@googlemail.com>
Subject Mangled diacritic characters in metadata
Date Mon, 18 Jul 2016 12:15:29 GMT
Using pdf-box-2.0.2:

I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
the metadata of my PDF however my diacritical characters seem to get
mangled when I try and read the PDF back.

My writing code looks like:

PDDocument doc = ...
PDDocumentCatalog catalog = ...

PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
  .orElseGet(() -> new PDMetadata(doc));
XMPMetadata xmpMetadata = null;
try(COSInputStream is = metadataStream.createInputStream()) {
  xmpMetadata = new DomXmpParser().parse(is);
} catch(XmpParsingException e) {
  xmpMetadata = XMPMetadata.createXMPMetadata();
DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XmpSerializer serializer = new XmpSerializer();
serializer.serialize(xmpMetadata, baos, false);

My reading code looks like:

PDDocment doc = PDDocument.load(is);
PDDocumentCatalog catalog = doc.getDocumentCatalog()
PDMetadata metadata = catalog.getMetadata()
try(InputStream is = metadata.createInputStream()) {
   Files.copy(is, Paths.get("/tmp/metadata.xml"));

However in the output XML I am seeing this:

        <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>

So I guess something is up with the character encoding somewhere? Is
this something I am doing incorrectly, perhaps I need to specify UTF-8
somewhere (my character set)? or is this a bug in pdf-box?

Cheers Adam.

Adam Retter

skype: adam.retter
tweet: adamretter

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message