lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul <paul.fuehr...@gmail.com>
Subject experiences with PDF files
Date Tue, 23 Nov 2004 20:25:23 GMT
Hi,
I read a lot of mails about the time consuming pdf-parsing and tried
myself some solutions. My example PDF file has 181 pages in 1,5 MB
(mostly text nearly no grafics).
-with pdfbox.org's toolkit it took 17m32s to parse&read it's content
-after installing ghostscript and ps2text / ps2ascii my parsing failed
after page 54 and 2m51s because of irregular fonts
-installing XPDF and using it's tool pdftotext parsing completed after
7-10seconds

My machine is a Celeren 1700 with VMWare Workstation 3.2 (128 MB
assigned) and Linux Suse 7.3.

I will parse my pdf files with xpdf and something like
Runtime.getRuntime().exec("pdftotext -nopgbrk -raw "+pdfFileName+"
"+txtFileName);


Paul

P.S. look at http://www.jguru.com/faq/view.jsp?EID=1074237 for links and tipps

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message