pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From adues...@mail.uni-paderborn.de
Subject Re: Problems converting Special characters from PDF to text
Date Tue, 11 Mar 2014 09:26:27 GMT
saw attachements doesnt make it

here is my pdf sample:
http://www.file-upload.net/download-8701725/sample.pdf.html

and converted with pdflib:
http://www.file-upload.net/download-8701726/sampleConvertedWithPdflib.txt.html


Zitat von Olaf Drümmer <olaflist@callassoftware.com>:

> If the text encoding or ToUnicode table for that character does not  
> make the connection to the right Unicode value - then it can't be  
> extracted properly.
>
> There are two quick ways to double check:
> - use a recent version of Adobe Reader or Adobe Acrobat, copy the  
> piece of text in question and paste it into a Unicode enabled text  
> window or control
> - try text extraction with PDFlib TET (cf.  
> http://www.pdflib.com/download/tet/ )
>
> If neither of these get the right Unicode values, you are probably  
> out of luck. If they are showing the right character, report back.
>
> You could also use a low level inspection tool (for example in  
> Acrobat Pro, use Preflight and from the Preflight window's options  
> menu choose "Explore PDF structure") and drill down to the resp.  
> font resource and find out whether it has a decent ToUnicode entry  
> or not.
>
> Olaf
>
>
> ---
>
> Olaf Druemmer | Managing Director | callas software GmbH |  
> Schoenhauser Allee 6/7 | 10119 Berlin
> Tel +49.30.4439031-0 | Fax +49.30.4416402 |  
> o.druemmer@callassoftware.com | www.callassoftware.com
>
> ?  PDF Days Europe 2014 - June 16-17, 2014 ·  Cologne
> ?  Two days packed with PDF ? Register now at:
> ?  http://pdfa.org/pdf-days-europe-2014
>
>
>
> Am 10 Mar 2014 um 19:56 schrieb Andreas Düster <aduester@mail.upb.de>:
>
>> Hi,
>>
>> I am using PDFBox 1.7.0 (unofficial converted .net version from  
>> http://pdfbox.lehmi.de/) to convert a pdf to text. It works fine  
>> for me except one thing.
>> My Problem is that the pdf contains gaussian brackets which are  
>> converted to a single letter (right ceil is converted to a "d" and  
>> left floor one to a "c". I need at least something unique because I  
>> want parse the text later and I need to localize the brackets.
>> I am not sure if thats a problem of PDF box at all. If I copy the  
>> bracket out of the pdf manually it?s the same behavior. Any idea to  
>> help me?
>>
>> Thanks!!
>>
>>
>
>




Mime
View raw message