lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject Re: Need advice: what pdf lib to use?
Date Fri, 22 Oct 2004 12:40:24 GMT

Please post any PDFBox issues you notice on the PDFBox sourceforge bug
list, if possible attach/email any problem PDFs that you encounter.

There are some efforts underway to improve the speed of PDFBox, you can
monitor the progress at
http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832

As for other suggestions, I know some people have utilized xpdf(open
source but non Java) to extract the text.

For other Java solutions
PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java"
http://snowtide.com/home/PDFTextStream/

Etymon PJ (GPL)
http://www.etymon.com/

Ben
http://www.pdfbox.org



On Fri, 22 Oct 2004 iouli.golovatyi@group.novartis.com wrote:

> Hello all,
>
> I need a piece of advice/experience..
>
> What pdf parser (written in java) u'd recommend?
>
> I played now with PDFBox-0.6.7a and would not say I was satisfied too much
> with it
>
> On certain pdf's (not well formated but anyway readable with acrobate)  it
> run into dead loop (this I could fix in code),
> and on one file it produced "out of memory error" and killed jvm:( (this
> problem I could not identify yet)
>
> After all the performance was not too great as well: it took c. 19 h. to
> index 13000 files (c. 3.5Gb)
>
> Regards,
> J.
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message