pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yushuang Hao <yushuang....@codean.com>
Subject PDFBox issues
Date Wed, 11 Jul 2012 10:08:39 GMT
Dear Sir/Madam,

I experienced two issues when I was using the PDFBOX 1.7.0 to convert the
PDF to Text:

Firstly, the PDF is purely in English but after conversion I get random CJK
characters in it. I have figured out this as under UTF-8 the Latin
character takes 1 bit ranging from 0x0000 to 0x00FF in Unicode, somehow the
conversion randomly compressed two Latin characters together as a 2 bits
CJK character. For example, I got "k" (0x5365) rather than getting
"S"(0x0053) and "e"(0x0065). I don't know how this happened but I managed
to convert this to the right ones.

My second issue is in the same document the "?" was produced for where it
should be 3,4,6,7,8,9,),* or %, see below example. Can you give me some
hints how to solve this? Many thanks.

In PDF:
TERM C1 EUR 591736DB6 LX038684 07-Jun-2016 Shadow Shadow 450.0 0.00 0.404
4.9040 0.00 0.00 462,025.59 462,025.59

Conversion:
07-Jun-201?TERM C1 EUR  462,025.5?Shadow Shadow  0.00 0.40? 450.0
0.00591736DB?  4.9040  0.00  462,025.5?LX03868?

Kind regards,
Yushuang

-- 

*Yushuang Hao*
Codean
King's Gate
1 Bravingtons Walk
London, N1 9AE, UK
yushuang.hao@codean.com

tel. +44 (0)20 3475 3548
mob. +44 (0)7973 816 879

www.codean.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message