lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul <>
Subject experiences with PDF files
Date Tue, 23 Nov 2004 20:25:23 GMT
I read a lot of mails about the time consuming pdf-parsing and tried
myself some solutions. My example PDF file has 181 pages in 1,5 MB
(mostly text nearly no grafics).
-with's toolkit it took 17m32s to parse&read it's content
-after installing ghostscript and ps2text / ps2ascii my parsing failed
after page 54 and 2m51s because of irregular fonts
-installing XPDF and using it's tool pdftotext parsing completed after

My machine is a Celeren 1700 with VMWare Workstation 3.2 (128 MB
assigned) and Linux Suse 7.3.

I will parse my pdf files with xpdf and something like
Runtime.getRuntime().exec("pdftotext -nopgbrk -raw "+pdfFileName+"


P.S. look at for links and tipps

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message