pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Villu Ruusmann <villu.ruusm...@gmail.com>
Subject Re: help ReplaceString.java
Date Mon, 21 Dec 2009 16:56:18 GMT
Hello there,

> To get it right one would have to use a general replacement of non-combining into combining
diacritics (and probably a normalisation process for unicode to replace combinations by single
characters). By the way, you might also have to look out for ligatures (e.g. ff ffi fi fl).

The need for text post-processing depends on the class you're using for the job.

Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
all texts are filtered through
before they are exposed to the application programmer via methods like
PDFTextStripper#writeString(String). However, it must be borne in mind
that TextNormalize relies on external ICU4J dependency - if it is not
properly installed, then the original string is returned unchanged.

Other classes such as org.apache.pdfbox.pdfviewer.PageDrawer do not do
it for you. For example, when overriding
PageDrawer#processTextPosition(TextPosition) with the intent of
capturing the text before it is painted, you must filter it through
TextNormalize manually to get the "correct" characters.


View raw message