pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Fischer <fischer...@aon.at>
Subject Re: help ReplaceString.java
Date Mon, 21 Dec 2009 10:50:52 GMT
Hello Stan,

I'm trying to evaluate different options to extract text from PDF files, PDFBox (v.8) being
one of them.
My experience is:
– what you get depends on the way the PDF file was created (mine are TeX based, and there
is a wide variety),
– you may need to post-process the extracted text,
– PDFBox is not perfect, but among the best for this job.

Here is an example (not Romanian though, but similar I suppose):
Białynicki (polish ł) works correctly
Świȩcicka comes out as S´wie¸cicka
Here "´" and "¸" are the non-combining equivalents of the combing diacritics.
To get it right one would have to use a general replacement of non-combining into combining
diacritics (and probably a normalisation process for unicode to replace combinations by single
characters). By the way, you might also have to look out for ligatures (e.g. ff ffi fi fl).
And beware: these are the best possible results I found. With other PDFs, you might lose diacritic
characters completely (both base and decoration), get the diacritic signs reversed (probably
only some of them), or scattered over the respective line with no reference to the decorated
character (you might have picked up one of those before your "„").


Am 19.12.2009 um 00:09 schrieb Stan Ioan-Eugen:

> Hello,
> I'm having some difficulties using pdfbox. It does not behave how I expect
> and I don't know the problem. I'm tryng to build a pdf translation app using
> a translating engine. The idea is upload pdf, click button get pdf
> translated. The problem is that pdfbox messes up the characters. I tryed the
> ReplaceString.java application on a romanian newspaper pdf trying to replace
> a string. Pdfbox seems to mess up the diacritics. After replace the newly
> created PDF file shows as folows:
> ́„ instead of „
> ́” instead of ”
> (the leading quote should not be there, romainian quotation is like „quoted
> text” )
> ^fi instead of î (i circumflex)
> ~ and another character which did not display (displayed as an empty box)
> instead of ă (a grave i guess).
> -- 
> -stan ioan-eugen

View raw message