pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Fischer <fischer...@aon.at>
Subject Re: help ReplaceString.java
Date Wed, 23 Dec 2009 00:04:25 GMT
Hi Villu,

I installed your recommended version of ICU4J 3.8 JAR in my classpath and it works very well
on standard PDF files.
Ligatures are resolved and diacritics are displayed correctly (although as combined characters)
for files created with Acrobat Distiller.
Files created with pdftex don't work as well: diacritical characters are completely lost (at
least for my test file); with dvipdfmx I get "FRÉDÉRIC" correct, but also "D epartement
de Math ematiques" for Département de Mathématiques" and "Z urich" for "Zürich" (the original
is in small capitals, probably this creates problems).
One file claiming to be created with "LaTeX with hyperref package" using "dvips + distiller"
crashes:
Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider

Thanks for the advice
Thomas


Am 22.12.2009 um 09:33 schrieb Villu Ruusmann:

> Hello there,
> 
>>> 
>>> The need for text post-processing depends on the class you're using for the job.
>>> 
>>> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
>>> all texts are filtered through
>>> org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String)
>>> before they are exposed to the application programmer via methods like
>>> PDFTextStripper#writeString(String). However, it must be borne in mind
>>> that TextNormalize relies on external ICU4J dependency - if it is not
>>> properly installed, then the original string is returned unchanged.
>>> 
>> This is interesting. I use PDFBox as a command line tool on my Mac:
>> java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt
>> Is there a way to activate some post-processing if I do it this way?
>> Or shouldn't it be included automatically?
>> 
> 
> The command-line application org.apache.pdfbox.ExtractText uses class
> org.apache.pdfbox.util.PDFTextStripper internally. So, in principle,
> there shouldn't be any need for text post-processing if the ICU4J
> dependency is properly installed.
> 
> Since PDFBox JAR comes in many flavours, it is very hard for me to
> tell if you have it all right or not. I guess the easiest solution
> would be to download ICU4J 3.8 JAR manually and append it to you
> command-line application's classpath. You can find the said JAR for
> example here:
> http://www.jarvana.com/jarvana/browse/com/ibm/icu/icu4j/3.8/
> 
> 
> VR


Mime
View raw message