pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andreas Lehmkühler" <andr...@lehmi.de>
Subject RE: Can't extract text Adobe-WinCharSetFFFF-UCS2
Date Fri, 20 Jul 2012 08:02:25 GMT

Stephen Haggai <Stephen.Haggai@swanninsurance.com.au> hat am 20. Juli 2012 um
05:44 geschrieben:

> _______________________________________________________________________________________
> Note: This e-mail is subject to the disclaimer contained at the bottom of this
> message.
> _______________________________________________________________________________________
> Hi,
> I have looked at the PDF file. It looks as if text in all the pages were
> scanned as images. I am certain that one cannot extract text from (text
> scanned as) images using PDFBox. Could someone correct me if I am wrong.

You are correct. The pdfs consists of scanned text and yes pdfbox can't extract
that text, but the images. Those could be used with a OCR-software to get the
text. I didn't try that but it should work, more or less precise.

BTW: It is always a good idea to extract the text using the acrobat reader. Just
select the text a copy and paste it to an editor. If that doesn't work it most
likely won't work using PDFBox.

> Thanks,
> Stephen
> -----Original Message-----
> From: Big Donkeys [mailto:big.donkeys@yahoo.com]
> Sent: Friday, 20 July 2012 6:09 AM
> To: users@pdfbox.apache.org
> Subject: Can't extract text Adobe-WinCharSetFFFF-UCS2
> Hi, I&#39;m having some troubles extracting text from some South Korean PDF
> files using PDFTextStripper.  When I try I get a "severe error could not parse
> predefined CMAP file for &#39;Adobe-WinCharSetFFFF-UCS2&#39;" message and then
> gives me some gibberish.  File opens and displays fine in Adobe reader.
>   I&#39;m using pdfbox-app-1.7.0.jar.
> Here is a link to an example PDF that gives me trouble:
> http://eng.khoa.go.kr/inc/func/fileDownloadBlob_nori.asp?cmsCd=CM0237&ntNo=626&fNo=4
> Any ideas?
> _______________________________________________________________________________________
> The information transmitted in this message and its attachments (if any) is
> intended
> only for the person or entity to which it is addressed.
> The message may contain confidential and/or privileged material. Any review,
> retransmission, dissemination or other use of, or taking of any action in
> reliance
> upon this information, by persons or entities other than the intended
> recipient is
> prohibited.
> If you have received this in error, please contact the sender and delete this
> e-mail
> and associated material from any computer.
> The intended recipient of this e-mail may only use, reproduce, disclose or
> distribute
> the information contained in this e-mail and any attached files, with the
> permission
> of the sender.
> This message has been scanned for viruses.
> _______________________________________________________________________________________

Andreas Lehmkühler
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message