pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Haggai <Stephen.Hag...@swanninsurance.com.au>
Subject RE: Can't extract text Adobe-WinCharSetFFFF-UCS2
Date Fri, 20 Jul 2012 03:44:36 GMT


Note: This e-mail is subject to the disclaimer contained at the bottom of this message.


I have looked at the PDF file. It looks as if text in all the pages were scanned as images.
I am certain that one cannot extract text from (text scanned as) images using PDFBox. Could
someone correct me if I am wrong.


-----Original Message-----
From: Big Donkeys [mailto:big.donkeys@yahoo.com] 
Sent: Friday, 20 July 2012 6:09 AM
To: users@pdfbox.apache.org
Subject: Can't extract text Adobe-WinCharSetFFFF-UCS2

Hi, I&#39;m having some troubles extracting text from some South Korean PDF files using
PDFTextStripper.  When I try I get a "severe error could not parse predefined CMAP file for
&#39;Adobe-WinCharSetFFFF-UCS2&#39;" message and then gives me some gibberish.  File
opens and displays fine in Adobe reader.   I&#39;m using pdfbox-app-1.7.0.jar.

Here is a link to an example PDF that gives me trouble:


Any ideas?  


The information transmitted in this message and its attachments (if any) is intended 
only for the person or entity to which it is addressed.
The message may contain confidential and/or privileged material. Any review, 
retransmission, dissemination or other use of, or taking of any action in reliance 
upon this information, by persons or entities other than the intended recipient is 

If you have received this in error, please contact the sender and delete this e-mail 
and associated material from any computer.

The intended recipient of this e-mail may only use, reproduce, disclose or distribute 
the information contained in this e-mail and any attached files, with the permission 
of the sender.

This message has been scanned for viruses.

View raw message