lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject Re: PDF Text Stripper
Date Tue, 09 Jul 2002 15:07:23 GMT
Can you send me the PDF document that you are having problems with and I
will look into it.

There are still some issues that I am working out with the spacing of
characters.
-Ben



On Tue, 9 Jul 2002, Keith Gunn wrote:

> On Tue, 9 Jul 2002, Ben Litchfield wrote:
>
> > Hi,
> >
> > I have written a PDF library that can be used to strip text from PDF
> > documents.  It is released under LGPL so have fun.
> >
> > There is one class which can be used to easily index PDF documents.
> > pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
> > method which will take a PDF file and return a Lucene Document which you
> > can add to an index.
> >
> > If you would like to see the quality of the text extraction you can run
> > pdfparser.Main from the command line which will take a PDF document and
> > write a txt file.
> >
> > I am looking for any input that you might have.  Please mail me if you
> > have any bugs or feature requests.
> >
> > The library can be retrieved from
> > http://www.csh.rit.edu/~ben/projects/pdfparser/
> >
> > -Ben Litchfield
>
> hi,
>
> I downloaded the zip and quickly ran the demo on a few files, it displays
> .notdef between words and there are spaces between every letter for words,
> is there code in your dist. to remove these so that just terms remain?
>
> Keith Gunn
> University Of Aberdeen
>
>
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>

-- 


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message