pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Issues with extraction content of PDF files
Date Fri, 18 Dec 2015 17:57:44 GMT

I'm indexing some PDF documents in Solr. However, for certain PDF files,
there are chinese text in the documents, but after indexing, what is
indexed in the content is either a series of "??????" or an empty content.

i've also tried on the Tika app, and I get the same results.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access
via the link here:


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message