pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Issues with extraction content of PDF files
Date Fri, 18 Dec 2015 18:40:18 GMT
  So that you don't have to do the initial diagnosis at least.  From [0]:

>>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode mapping
for CID+71
(71) in font 505Eddc6Arial
>>So, if the file has no Unicode mapping for the font, I doubt they'll be able to fix
>>pdftotext is also unable to extract anything useful from the file.

 [0]  http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3CBY2PR09MB11297223E13E266CFB2A5FFC7E00@BY2PR09MB112.namprd09.prod.outlook.com%3E

-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com] 
Sent: Friday, December 18, 2015 12:58 PM
To: users@pdfbox.apache.org
Subject: Issues with extraction content of PDF files


I'm indexing some PDF documents in Solr. However, for certain PDF files, there are chinese
text in the documents, but after indexing, what is indexed in the content is either a series
of "??????" or an empty content.

i've also tried on the Tika app, and I get the same results.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access via the link here:

View raw message