pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Fischer <fischer...@aon.at>
Subject Re: help ReplaceString.java
Date Mon, 21 Dec 2009 18:26:51 GMT
Hi Villu,


> Hello there,
> 
>> To get it right one would have to use a general replacement of non-combining into
combining diacritics (and probably a normalisation process for unicode to replace combinations
by single characters). By the way, you might also have to look out for ligatures (e.g. ff
ffi fi fl).
> 
> The need for text post-processing depends on the class you're using for the job.
> 
> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
> all texts are filtered through
> org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String)
> before they are exposed to the application programmer via methods like
> PDFTextStripper#writeString(String). However, it must be borne in mind
> that TextNormalize relies on external ICU4J dependency - if it is not
> properly installed, then the original string is returned unchanged.
> 
> Other classes such as org.apache.pdfbox.pdfviewer.PageDrawer do not do
> it for you. For example, when overriding
> PageDrawer#processTextPosition(TextPosition) with the intent of
> capturing the text before it is painted, you must filter it through
> TextNormalize manually to get the "correct" characters.
> 
This is interesting. I use PDFBox as a command line tool on my Mac:
java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt
Is there a way to activate some post-processing if I do it this way?
Or shouldn't it be included automatically?

All the best
Thomas



Mime
View raw message