pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Aker" <ahmet.a...@sheffield.ac.uk>
Subject Arabic compound characters not recognized by pdfbox
Date Sun, 21 Sep 2014 20:34:31 GMT
Hi,

I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases
however, it does have problems in recognizing compound characters. I am
attaching you a sample pdf file. In that e.g. I get
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should
be getting
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني).
The
pdfBox misses the bit highlighted red.   The same is valid for:

 

&#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)

 

Has this maybe to do with the encodings? I hope you can help me on this
matter.

 

Many thanks,

ahmet

 

 

 


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message