pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Villu Ruusmann <villu.ruusm...@gmail.com>
Subject Re: help ReplaceString.java
Date Tue, 22 Dec 2009 08:33:20 GMT
Hello there,

>>
>> The need for text post-processing depends on the class you're using for the job.
>>
>> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
>> all texts are filtered through
>> org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String)
>> before they are exposed to the application programmer via methods like
>> PDFTextStripper#writeString(String). However, it must be borne in mind
>> that TextNormalize relies on external ICU4J dependency - if it is not
>> properly installed, then the original string is returned unchanged.
>>
> This is interesting. I use PDFBox as a command line tool on my Mac:
> java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt
> Is there a way to activate some post-processing if I do it this way?
> Or shouldn't it be included automatically?
>

The command-line application org.apache.pdfbox.ExtractText uses class
org.apache.pdfbox.util.PDFTextStripper internally. So, in principle,
there shouldn't be any need for text post-processing if the ICU4J
dependency is properly installed.

Since PDFBox JAR comes in many flavours, it is very hard for me to
tell if you have it all right or not. I guess the easiest solution
would be to download ICU4J 3.8 JAR manually and append it to you
command-line application's classpath. You can find the said JAR for
example here:
http://www.jarvana.com/jarvana/browse/com/ibm/icu/icu4j/3.8/


VR

Mime
View raw message