pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A...@swmc.com
Subject Re: help ReplaceString.java
Date Wed, 23 Dec 2009 00:29:33 GMT
BouncyCastleProvider is the class which deals with PDF encryption.  I 
think you want bcprov-jdk14-132.jar and maybe bcmail-jdk14-132.jar to 
resolve that.  There's an ant command which will download these for you, 
but I don't remember what it is.  The other option is to find it online, 
download it and make sure it's in your classpath.


Thomas Fischer <fischer.th@aon.at>
12/22/2009 16:05
Re: help ReplaceString.java

Hi Villu,

I installed your recommended version of ICU4J 3.8 JAR in my classpath and 
it works very well on standard PDF files.
Ligatures are resolved and diacritics are displayed correctly (although as 
combined characters) for files created with Acrobat Distiller.
Files created with pdftex don't work as well: diacritical characters are 
completely lost (at least for my test file); with dvipdfmx I get 
"FRÉDÉRIC" correct, but also "D epartement de Math ematiques" for 
Département de Mathématiques" and "Z urich" for "Zürich" (the original is 
in small capitals, probably this creates problems).
One file claiming to be created with "LaTeX with hyperref package" using 
"dvips + distiller" crashes:
Exception in thread "main" java.lang.NoClassDefFoundError: 

Thanks for the advice

Am 22.12.2009 um 09:33 schrieb Villu Ruusmann:

> Hello there,
>>> The need for text post-processing depends on the class you're using 
for the job.
>>> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
>>> all texts are filtered through
>>> before they are exposed to the application programmer via methods like
>>> PDFTextStripper#writeString(String). However, it must be borne in mind
>>> that TextNormalize relies on external ICU4J dependency - if it is not
>>> properly installed, then the original string is returned unchanged.
>> This is interesting. I use PDFBox as a command line tool on my Mac:
>> java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt
>> Is there a way to activate some post-processing if I do it this way?
>> Or shouldn't it be included automatically?
> The command-line application org.apache.pdfbox.ExtractText uses class
> org.apache.pdfbox.util.PDFTextStripper internally. So, in principle,
> there shouldn't be any need for text post-processing if the ICU4J
> dependency is properly installed.
> Since PDFBox JAR comes in many flavours, it is very hard for me to
> tell if you have it all right or not. I guess the easiest solution
> would be to download ICU4J 3.8 JAR manually and append it to you
> command-line application's classpath. You can find the said JAR for
> example here:
> http://www.jarvana.com/jarvana/browse/com/ibm/icu/icu4j/3.8/
> VR

?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.
 is confidential and/or legally privileged. The information is intended only for the use of
the individual or entity named on this email. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or the taking of any action in
reliance on the contents of this email information is strictly prohibited, and that the documents
should be returned to this office immediately by email. Receipt by anyone other than the intended
recipient is not a waiver of any privilege. Please do not include your social security number,
account number, or any other personal or financial information in the content of the email.
Should you have any questions, please call  (800) 453 7884.   
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message