lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From iouli.golova...@group.novartis.com
Subject Re: Need advice: what pdf lib to use?
Date Mon, 25 Oct 2004 08:16:11 GMT
PDFbox stumbles also with "class java.io.IOException with message:  - You 
do not have permission to extract text" in case the doc is copy/print 
protected.
I tested now the snowtide commercial product and it looks like it could 
process these files as well. Performance was also not so bad. Unfortunatly 
the test result could not be considered as 100%, because the free version 
processed just first  8  pages.  After all this product costs a fortune 
(as long the company is ready to pay I don't realy mind:))

J.





Robert Newson <rnewson@connected.com>
Sent by: news <news@sea.gmane.org>
24.10.2004 17:44
Please respond to "Lucene Users List"

 
        To:     lucene-user@jakarta.apache.org
        cc:     (bcc: Iouli Golovatyi/X/GP/Novartis)
        Subject:        Re: Need advice: what pdf lib to use?
        Category: 



iouli.golovatyi@group.novartis.com wrote:
> Hello all,
> 
> I need a piece of advice/experience..
> 
> What pdf parser (written in java) u'd recommend?
> 
> I played now with PDFBox-0.6.7a and would not say I was satisfied too 
much 
> with it
> 
> On certain pdf's (not well formated but anyway readable with acrobate) 
it 
> run into dead loop (this I could fix in code),
> and on one file it produced "out of memory error" and killed jvm:( (this 

> problem I could not identify yet)
> 
> After all the performance was not too great as well: it took c. 19 h. to 

> index 13000 files (c. 3.5Gb)
> 
> Regards,
> J.
> 
> 
> 

On the specific problem of the "dead loop", I reported an instance of 
this to Ben a week or so ago and he has fixed it in the latest 
nightlies.  I expect an official release will include this bugfix soon. 
The file in question was unreadable with any PDF software I have, but 
someone managed to create it somehow...

http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832

I've found pdfbox to be pretty good. The only time I get problems is 
with corrupted or egregiously bad PDF files.

B.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message