pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Can't extract text Adobe-WinCharSetFFFF-UCS2
Date Sat, 21 Jul 2012 17:40:36 GMT
Hi,

Am 20.07.2012 10:02, schrieb "Andreas Lehmkühler":
> Hi,
>
>
> Stephen Haggai<Stephen.Haggai@swanninsurance.com.au>  hat am 20. Juli 2012 um
> 05:44 geschrieben:
>
>>
>> _______________________________________________________________________________________
>>
>> Note: This e-mail is subject to the disclaimer contained at the bottom of this
>> message.
>> _______________________________________________________________________________________
>>
>>
>> Hi,
>>
>> I have looked at the PDF file. It looks as if text in all the pages were
>> scanned as images. I am certain that one cannot extract text from (text
>> scanned as) images using PDFBox. Could someone correct me if I am wrong.
>
>
> You are correct. The pdfs consists of scanned text and yes pdfbox can't extract
I've to corrrect myself. There is no single image containing the scanned text. 
It consists of thousands of small lines, like the following 3 ones:


319.5 3175.84 m
319.5 3175.84 l
S
353.5 3175.84 m
353.5 3175.84 l
S
376.5 3175.84 m
376.5 3175.84 l
S

So if you want to use an image to an ocr software you have to use PDFToImage


BR
Andreas Lehmkühler

> that text, but the images. Those could be used with a OCR-software to get the
> text. I didn't try that but it should work, more or less precise.
>
> BTW: It is always a good idea to extract the text using the acrobat reader. Just
> select the text a copy and paste it to an editor. If that doesn't work it most
> likely won't work using PDFBox.
>
>
>>
>> Thanks,
>> Stephen
>>
>> -----Original Message-----
>> From: Big Donkeys [mailto:big.donkeys@yahoo.com]
>> Sent: Friday, 20 July 2012 6:09 AM
>> To: users@pdfbox.apache.org
>> Subject: Can't extract text Adobe-WinCharSetFFFF-UCS2
>>
>> Hi, I&#39;m having some troubles extracting text from some South Korean PDF
>> files using PDFTextStripper.  When I try I get a "severe error could not parse
>> predefined CMAP file for&#39;Adobe-WinCharSetFFFF-UCS2&#39;" message and
then
>> gives me some gibberish.  File opens and displays fine in Adobe reader.
>>    I&#39;m using pdfbox-app-1.7.0.jar.
>>
>> Here is a link to an example PDF that gives me trouble:
>>
>> http://eng.khoa.go.kr/inc/func/fileDownloadBlob_nori.asp?cmsCd=CM0237&ntNo=626&fNo=4
>>
>> Any ideas?
>>
>> _______________________________________________________________________________________
>>
>> The information transmitted in this message and its attachments (if any) is
>> intended
>> only for the person or entity to which it is addressed.
>> The message may contain confidential and/or privileged material. Any review,
>> retransmission, dissemination or other use of, or taking of any action in
>> reliance
>> upon this information, by persons or entities other than the intended
>> recipient is
>> prohibited.
>>
>> If you have received this in error, please contact the sender and delete this
>> e-mail
>> and associated material from any computer.
>>
>> The intended recipient of this e-mail may only use, reproduce, disclose or
>> distribute
>> the information contained in this e-mail and any attached files, with the
>> permission
>> of the sender.
>>
>> This message has been scanned for viruses.
>> _______________________________________________________________________________________
>
> Br
> Andreas Lehmkühler


Mime
View raw message