pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Mangled diacritic characters in metadata
Date Tue, 19 Jul 2016 07:13:46 GMT
Hi,

> Am 18.07.2016 um 14:15 schrieb Adam Retter <adam.retter@googlemail.com>:
> 
> Using pdf-box-2.0.2:
> 
> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
> the metadata of my PDF however my diacritical characters seem to get
> mangled when I try and read the PDF back.
> 
> My writing code looks like:
> 
> PDDocument doc = ...
> PDDocumentCatalog catalog = ...
> 
> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
>  .orElseGet(() -> new PDMetadata(doc));
> XMPMetadata xmpMetadata = null;
> try(COSInputStream is = metadataStream.createInputStream()) {
>  xmpMetadata = new DomXmpParser().parse(is);
> } catch(XmpParsingException e) {
>  LOG.warn(e);
>  xmpMetadata = XMPMetadata.createXMPMetadata();
> }
> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
> catalog.setMetadata(xmpMetadata);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> XmpSerializer serializer = new XmpSerializer();
> serializer.serialize(xmpMetadata, baos, false);
> metadataStream.importXMPMetadata(baos.toByteArray());
> 
> 
> My reading code looks like:
> 
> PDDocment doc = PDDocument.load(is);
> PDDocumentCatalog catalog = doc.getDocumentCatalog()
> PDMetadata metadata = catalog.getMetadata()
> try(InputStream is = metadata.createInputStream()) {
>   Files.copy(is, Paths.get("/tmp/metadata.xml"));
> }
> 
> 
> However in the output XML I am seeing this:
> 
> <dc:publisher>
>    <rdf:Bag>
>        <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
>    </rdf:Bag>
> </dc:publisher>
> 
> 

I've tested various ways of saving the file, yours, serializing to FileOutputStream … and
all work with when viewing the content in a browser ot a text editor.


<dc:publisher>
        <rdf:Bag>
          <rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li>
        </rdf:Bag>
      </dc:publisher>

Where do you see that string?

BR
Maruan



> So I guess something is up with the character encoding somewhere? Is
> this something I am doing incorrectly, perhaps I need to specify UTF-8
> somewhere (my character set)? or is this a bug in pdf-box?
> 
> Cheers Adam.
> 
> 
> 
> 
> 
> -- 
> Adam Retter
> 
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message