pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Düster <adues...@mail.upb.de>
Subject AW: Problems converting Special characters from PDF to text
Date Wed, 12 Mar 2014 07:07:51 GMT
Hi,

the pdf is from a Specification. So it can't be controlled how it is
created. The result I get with PDF Lib is OK for me. But I want to use
PDFBox for it. Is this a configuration thing of PDFBox maybe?


-----Ursprüngliche Nachricht-----
Von: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Gesendet: Dienstag, 11. März 2014 15:42
An: users@pdfbox.apache.org
Betreff: Re: Problems converting Special characters from PDF to text

Hi,

I took a quick look but at this point in time I can’t think about an easy
solution. Adobe Reader gives the same result when copying and pasting the
text. I know not very helpful in your case. As the internal structure of you
PDF uses what’s called a Differences Array one would need more time to
inspect the characters, the encoding and the glyphs to find if the result
could potentially be enhanced. I’m not very optimistic though.

Is there a way to control how the PDF is created? Could that be changed?

BR
Maruan Sahyoun

Am 11.03.2014 um 14:17 schrieb aduester@mail.uni-paderborn.de:

> Found a solution to localize the brackets.
> 
> Only problem i have is:
> I only get the right output text file with PDFLib and not with PDFBox.
> Cant use PdfLib for documents with more then one page.
> 
> I attach the 2 different outputs
> 
> Zitat von aduester@mail.uni-paderborn.de:
> 
>> when i open the converted txt file in browser it is displayed correct.
>> is there a way to convert the character in pdf to unicode in txt?
>> 
>> like
>> 
>> text U+2308 text
>> 
>> 
>> Zitat von aduester@mail.uni-paderborn.de:
>> 
>>> thanks for the advice. I tried copying the character to notepad++ and
its the same behavior. If i use pdflib i get blanket squares for the
special characters.
>>> 
>>> the characters are these:
>>> http://www.fileformat.info/info/unicode/char/2308/index.htm
>>> http://www.fileformat.info/info/unicode/char/230b/index.htm
>>> 
>>> i dont have acrobat pro version.
>>> 
>>> added attachement
>>> 
>>> thank you
>>> 
>>> Zitat von Olaf Drümmer <olaflist@callassoftware.com>:
>>> 
>>>> If the text encoding or ToUnicode table for that character does not
make the connection to the right Unicode value - then it can't be  extracted
properly.
>>>> 
>>>> There are two quick ways to double check:
>>>> - use a recent version of Adobe Reader or Adobe Acrobat, copy the  
>>>> piece of text in question and paste it into a Unicode enabled text  
>>>> window or control
>>>> - try text extraction with PDFlib TET (cf.  
>>>> http://www.pdflib.com/download/tet/ )
>>>> 
>>>> If neither of these get the right Unicode values, you are probably  out
of luck. If they are showing the right character, report back.
>>>> 
>>>> You could also use a low level inspection tool (for example in  Acrobat
Pro, use Preflight and from the Preflight window's options  menu choose
"Explore PDF structure") and drill down to the resp.  font resource and find
out whether it has a decent ToUnicode entry  or not.
>>>> 
>>>> Olaf
>>>> 
>>>> 
>>>> ---
>>>> 
>>>> Olaf Druemmer | Managing Director | callas software GmbH |  
>>>> Schoenhauser Allee 6/7 | 10119 Berlin Tel +49.30.4439031-0 | Fax 
>>>> +49.30.4416402 |  o.druemmer@callassoftware.com | 
>>>> www.callassoftware.com
>>>> 
>>>> ?  PDF Days Europe 2014 - June 16-17, 2014 ·  Cologne ?  Two days 
>>>> packed with PDF ? Register now at:
>>>> ?  http://pdfa.org/pdf-days-europe-2014
>>>> 
>>>> 
>>>> 
>>>> Am 10 Mar 2014 um 19:56 schrieb Andreas Düster <aduester@mail.upb.de>:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am using PDFBox 1.7.0 (unofficial converted .net version from
http://pdfbox.lehmi.de/) to convert a pdf to text. It works fine  for me
except one thing.
>>>>> My Problem is that the pdf contains gaussian brackets which are
converted to a single letter (right ceil is converted to a "d" and  left
floor one to a "c". I need at least something unique because I  want parse
the text later and I need to localize the brackets.
>>>>> I am not sure if thats a problem of PDF box at all. If I copy the
bracket out of the pdf manually it?s the same behavior. Any idea to  help
me?
>>>>> 
>>>>> Thanks!!
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> <sampleConvertedWithPdflib.txt><sampleConvertedByPDFBox.txt><sample.pd
> f>



Mime
View raw message