pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From adues...@mail.uni-paderborn.de
Subject Re: Problems converting Special characters from PDF to text
Date Tue, 11 Mar 2014 13:17:28 GMT
Found a solution to localize the brackets.

Only problem i have is:
I only get the right output text file with PDFLib and not with PDFBox.
Cant use PdfLib for documents with more then one page.

I attach the 2 different outputs

Zitat von aduester@mail.uni-paderborn.de:

> when i open the converted txt file in browser it is displayed correct.
> is there a way to convert the character in pdf to unicode in txt?
>
> like
>
> text U+2308 text
>
>
> Zitat von aduester@mail.uni-paderborn.de:
>
>> thanks for the advice. I tried copying the character to notepad++  
>> and  its the same behavior. If i use pdflib i get blanket squares  
>> for the  special characters.
>>
>> the characters are these:
>> http://www.fileformat.info/info/unicode/char/2308/index.htm
>> http://www.fileformat.info/info/unicode/char/230b/index.htm
>>
>> i dont have acrobat pro version.
>>
>> added attachement
>>
>> thank you
>>
>> Zitat von Olaf Drümmer <olaflist@callassoftware.com>:
>>
>>> If the text encoding or ToUnicode table for that character does  
>>> not  make the connection to the right Unicode value - then it  
>>> can't be  extracted properly.
>>>
>>> There are two quick ways to double check:
>>> - use a recent version of Adobe Reader or Adobe Acrobat, copy the   
>>> piece of text in question and paste it into a Unicode enabled text  
>>>  window or control
>>> - try text extraction with PDFlib TET (cf.   
>>> http://www.pdflib.com/download/tet/ )
>>>
>>> If neither of these get the right Unicode values, you are probably  
>>>  out of luck. If they are showing the right character, report back.
>>>
>>> You could also use a low level inspection tool (for example in   
>>> Acrobat Pro, use Preflight and from the Preflight window's options  
>>>  menu choose "Explore PDF structure") and drill down to the resp.   
>>> font resource and find out whether it has a decent ToUnicode entry  
>>>  or not.
>>>
>>> Olaf
>>>
>>>
>>> ---
>>>
>>> Olaf Druemmer | Managing Director | callas software GmbH |   
>>> Schoenhauser Allee 6/7 | 10119 Berlin
>>> Tel +49.30.4439031-0 | Fax +49.30.4416402 |   
>>> o.druemmer@callassoftware.com | www.callassoftware.com
>>>
>>> ?  PDF Days Europe 2014 - June 16-17, 2014 ·  Cologne
>>> ?  Two days packed with PDF ? Register now at:
>>> ?  http://pdfa.org/pdf-days-europe-2014
>>>
>>>
>>>
>>> Am 10 Mar 2014 um 19:56 schrieb Andreas Düster <aduester@mail.upb.de>:
>>>
>>>> Hi,
>>>>
>>>> I am using PDFBox 1.7.0 (unofficial converted .net version from   
>>>> http://pdfbox.lehmi.de/) to convert a pdf to text. It works fine   
>>>> for me except one thing.
>>>> My Problem is that the pdf contains gaussian brackets which are   
>>>> converted to a single letter (right ceil is converted to a "d"  
>>>> and  left floor one to a "c". I need at least something unique  
>>>> because I  want parse the text later and I need to localize the  
>>>> brackets.
>>>> I am not sure if thats a problem of PDF box at all. If I copy the  
>>>>  bracket out of the pdf manually it?s the same behavior. Any idea  
>>>> to  help me?
>>>>
>>>> Thanks!!
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>
>


Mime
View raw message