pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stan Ioan-Eugen <stan.ieu...@gmail.com>
Subject help ReplaceString.java
Date Fri, 18 Dec 2009 23:09:31 GMT
Hello,

I'm having some difficulties using pdfbox. It does not behave how I expect
and I don't know the problem. I'm tryng to build a pdf translation app using
a translating engine. The idea is upload pdf, click button get pdf
translated. The problem is that pdfbox messes up the characters. I tryed the
ReplaceString.java application on a romanian newspaper pdf trying to replace
a string. Pdfbox seems to mess up the diacritics. After replace the newly
created PDF file shows as folows:

 ́„ instead of „
 ́” instead of ”
(the leading quote should not be there, romainian quotation is like „quoted
text” )
^fi instead of î (i circumflex)
~ and another character which did not display (displayed as an empty box)
instead of ă (a grave i guess).

If I replace string A with string B and string B contains diacritics, non of
string B's diacritics will be displayed correctly. But same diacritics like
ă, ș and ț from other parts of the document will be displayed correctly,
mind the exceptions above.

What can I do to get a correct PDF as output. My guess is that I have to
supply the correct characters because the PDF standard, AFAIK, does not
support romanian diacritics (which are ă â î ș ț )

caracter

nume Unicode

cod Unicode

glyph

Ă

Latin capital letter A with breve

0102

Abreve

ă

Latin small letter A with breve

0103

abreve

Â

Latin capital letter A with circumflex

00C2

Acircumflex

â

Latin small letter A with circumflex

00E2

acircumflex

Î

Latin capital letter I with circumflex

00CE

Icircumflex

î

Latin small letter I with circumflex

00EE

icircumflex

Ș

Latin capital letter S with comma below

0218

Scommaaccent

ș

Latin small letter S with comma below

0219

scommaaccent

Ț

Latin capital letter T with comma below

021A

uni021A

ț

Latin small letter T with comma below

021B

uni021B

Windows operating systems (up to Windows XP, including) have a default,
wrong mapping for Romanian characters, which is:

caracter

nume Unicode

cod Unicode

glyph

Ş

Latin capital letter S with cedilla

015E

Scedilla

ş

Latin small letter S with cedilla

015F

scedilla

Ţ

Latin capital letter T with cedilla

0162

uni0162

ţ

Latin small letter T with cedilla

0163

uni0163

I hope this can be done easily and documented.
Thanks, and happy holly days!

-- 
-stan ioan-eugen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message