pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Issues with extraction content of PDF files
Date Fri, 01 Jan 2016 02:43:17 GMT

> On 29 Dec 2015, at 00:34, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com> wrote:
> 
> Thanks for your reply Tilman.
> 
> Would like to find out, is the content extraction issue of this caused by the Identity-H
encoding?

Most likely. Identity-H is basically just "no encoding", so there needs to be a ToUnicode
 map in order to extract the text (which there isn't).

-- John

> Regards,
> Edwin
> 
> 
>> On 21 December 2015 at 16:12, Tilman Hausherr <THausherr@t-online.de> wrote:
>>> Am 21.12.2015 um 04:08 schrieb Zheng Lin Edwin Yeo:
>>> Thanks for your reply.
>>> 
>>> I tried on Adobe Acrobat Pro DC, it is able to open the file, but if open
>>> on Adobe Reader then it is not able to extract all the text properly.
>>> 
>>> Is there anyway which we can check what type of encoding is used for the
>>> PDF files?
>> 
>> Yes, in the font dictionaries, as you can see from this screenshot:
>> 
>> 
>> 
>> However this won't get you the text, obviously.
>> 
>> Tilman
>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> 
>>> 
>>> On 19 December 2015 at 03:07, Tilman Hausherr <THausherr@t-online.de> wrote:
>>> 
>>>>> Am 18.12.2015 um 18:57 schrieb Zheng Lin Edwin Yeo:
>>>>> 
>>>>> I've shared one of the file with the issue on dropbox, which you can
>>>>> access
>>>>> via the link here:
>>>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>> 
>>>> Adobe Reader is also unable to extract text.
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>> 
> 

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message