lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From iouli.golova...@group.novartis.com
Subject Re: Need advice: what pdf lib to use?
Date Mon, 25 Oct 2004 08:01:57 GMT
Ben,
many thanks for your complrehensive answer. Unfourtunatly I can not send 
the problem pdfs cause they are the property of company and are of top 
secrecy:)

Regards,
J.




Ben Litchfield <ben@csh.rit.edu>
22.10.2004 14:40
Please respond to "Lucene Users List"

 
        To:     Lucene Users List <lucene-user@jakarta.apache.org>
        cc:     (bcc: Iouli Golovatyi/X/GP/Novartis)
        Subject:        Re: Need advice: what pdf lib to use?
        Category: 




Please post any PDFBox issues you notice on the PDFBox sourceforge bug
list, if possible attach/email any problem PDFs that you encounter.

There are some efforts underway to improve the speed of PDFBox, you can
monitor the progress at
http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832

As for other suggestions, I know some people have utilized xpdf(open
source but non Java) to extract the text.

For other Java solutions
PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java"
http://snowtide.com/home/PDFTextStream/

Etymon PJ (GPL)
http://www.etymon.com/

Ben
http://www.pdfbox.org



On Fri, 22 Oct 2004 iouli.golovatyi@group.novartis.com wrote:

> Hello all,
>
> I need a piece of advice/experience..
>
> What pdf parser (written in java) u'd recommend?
>
> I played now with PDFBox-0.6.7a and would not say I was satisfied too 
much
> with it
>
> On certain pdf's (not well formated but anyway readable with acrobate) 
it
> run into dead loop (this I could fix in code),
> and on one file it produced "out of memory error" and killed jvm:( (this
> problem I could not identify yet)
>
> After all the performance was not too great as well: it took c. 19 h. to
> index 13000 files (c. 3.5Gb)
>
> Regards,
> J.
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message