lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Index entire filesystem
Date Wed, 05 Nov 2003 12:41:45 GMT
Stefan Groschupf wrote:
> There is some ongoing work for nutch.org.
> May be we can bundle all work together?! <open source>
> Nutch has alraeady a java *.doc, *.pdf parser as well .
> 
> Stefan
> 

I can give you a short account of my experiences with various PDF 
parsers out there, for the purpose of PDF -> text conversion:

* JPedal: previous versions used LGPL, as opposed to GPL it uses now. In 
my view, this practically excludes JPedal from commercial applications 
(which seems to be the point of this change). The library was able to 
correctly parse ca. 70% of my PDF collection. It is also clear how to 
extend some of the core algorithms for text grouping so that it can 
handle more complex layouts. It has a nice image extractor, and 
rasterizer, which means you can easily provide a preview or create 
thumbnails. However, it's excruciatingly slow and memory hungry.

* Etymon PJ: GPL :-( I have little experience with this library, because 
it lacks higher-level API for e.g. text extraction, and this task is 
non-trivial... There is also a commercial version, but I don't know the 
details.

* PDFBox: by far the best Java parser around, IMHO. It doesn't handle 
graphics at all (which might be a disadvantage for some), but does a 
very good job on text objects. It was able to correctly handle ca. 90% 
of my PDFs - the ones that failed either used font subsetting or were 
not quite up to the PDF spec. It can handle encrypted documents as well. 
However, it is relatively slow and memory hungry.

* Xpdf: not a Java library at all but native executable / library, and 
also GPL - however, as I understand GPL you are allowed to bundle it 
with your Java application so long as it is a separate executable (so 
using JNI is probably out of question...). It is my favorite solution 
right now for PDF-text conversion - it's very fast, very accurate and 
able to handle even multi-column layouts. Its Pdf-HTML converter is 
breath-taking, you have to see it to believe. For large PDFs (around 
10MB) PDFBox would take ca. 120 sec for processing, and Xpdf took 
(including overheads for forking and reading from stdout) ca. 10 secs.

So, as usually there is no ideal solution ... but rather than start yet 
another PDF parser effort I would encourage you to join efforts with 
e.g. PDFBox project.

Just my 0.02 PLN on the subject :-)

Regarding the format detection and conversion, I'm still looking for 
Java implementation of file(1) ... For now, I'm using the following 
steps to determine the file format:

* by extension - but you can burn your fingers with this. I had a case 
of a broken web ripper that saved everything (html, images, zips etc) as 
*.asp files, and obviously my application fell flat when it tried to 
swallow such files...

* by a subset of tests borrowed from file(1).

* by using fall-back converters when the primary converter fails. 
Obviously quite expensive. Fall-back converters register themselves for 
various formats they can handle to varying degrees (including Java 
implementation of strings(1) command as the last resort).

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message