lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject RE: PDFBox PDFExtractor
Date Mon, 12 Sep 2005 16:33:47 GMT

Text extraction from PDF documents is a fairly complex problem and is a
delicate balance between speed/memory/accuracy/...

How are you measuring your memory usage?

In my opinion your two viable options are PDFBox(directly or via slides
PDFExtractor) and PDFTextStream.  They both integrate with lucene fairly
easily.

I highly suggest you do some tests against your own set of PDF documents.
A new version of PDFBox was released this weekend and does have some
improvements in terms of speed and memory.

Ben Litchfield
PDFBox
http://www.pdfbox.org/


On Mon, 12 Sep 2005 Rod.Madden@ferguson.com wrote:

> Thanks for reply Jeroen ...does anyone have any
> experience / comments regarding the use of PDFTextStream
> versus PDFExtractor for working with PDF files ...the
> issue for us is that there appears to be very high
> memory usage when we work with PDF's using PDFExtractor.
>
> I have heard that PDFTextStream may be a better solution.
>
> Rod
>
> -----Original Message-----
> From: Jeroen Reijn [mailto:j.reijn@hippo.nl]
> Sent: Monday, September 12, 2005 11:58 AM
> To: java-user@lucene.apache.org
> Subject: Re: PDFBox PDFExtractor
>
> Hi Rod,
>
> PDFBox is a seperate project. The PDFExtractor in Jakarta Slide uses
> PDFBox's
> functionality to extract the information from the .pdf file.
>
> Hope this answers your question.
>
> Jeroen
>
>
> Rod.Madden@ferguson.com wrote:
> > Hi,
> >
> >
> >
> > I am new to Lucene and looking at some existing Lucene code....
> >
> >
> >
> > I am confused about the relationship ( if any ) between
> >
> > org.apache.slide.extractor.PDFExtractor methods and org.PDFBox.cos
> > methods
> >
> > for the purposes of working with PDF files.
> >
> >
> >
> > I have found info on the web regarding PDFBox, however, I have found
> > little
> >
> > regarding .PDFExtractor.
> >
> >
> >
> > I am curious since we are having some issues with indexing PDF files
> and
> >
> > I am wondering if PDFExtractor implements PDFBox or if it is a
> separate
> >
> > utility set.
> >
> >
> >
> > Rod.
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message