pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject TIKA-1678 PDF metadata extraction and UTF-16 encodings in the xmp
Date Wed, 15 Jul 2015 11:46:56 GMT

  Andrew Jackson recently opened TIKA-1678.  Tika tries to use Dublin Core items from the
xmp, and if that doesn't exist, it takes what it can find from the "regular" metadata.

Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly (? doubly?)
encoded in the xmp : \376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P

  Should we add a handler at the Tika level to deal with obvious BOM-marked strings we're
getting from the XMP, or should that be handled by PDFBox?  We're still using jempbox...will
XMPBox handle these correctly?

  Thank you!



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message