lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nestel, Frank" <frank.nes...@coi.de>
Subject AW: PDF parser for Lucene
Date Tue, 06 Nov 2001 09:24:53 GMT

Websearch does a very quick and very dirty job, searching more
or less heuristical for text in an PDF.
It fails for UTF-8 encoded fields and for all kinds text
which is not where the heuristics expect.
On the other hand it is surprising how far you get with such
a simple method.

There is a free library somewhere at http://www.etymon.com/pj/.
It seems to contain minor problems, but is fairly robust. But
it uses lots of CPU and Memory since it builds a proper internal
representation of a PDF. It can also manipulate PDFs.

Problem is that the PDF format does not store the encoding of
certain fields at all. After all we had test PDFs where even
the tools provided by Adobe failed to exptract the complete
textual content. For me this was kind of disappointing about 
the PDF format. But of course it is allready there, one has
to handle it ...

We implemented bot above libraries but used the one from
Websearch right now. This renders some documents hardly
findable.

Sigh,
Frank



> -----Ursprüngliche Nachricht-----
> Von: Paulo Gaspar [mailto:paulo.gaspar@krankikom.de]
> Gesendet am: Freitag, 2. November 2001 18:26
> An: Lucene Developers List
> Betreff: RE: PDF parser for Lucene
> 
> You could try this one:
>   http://www.i2a.com/websearch/
> 
> ...and then tell me how it works for you.
> =:o)
> 
> 
> Anyway, it is simple and Open Source.
> 
> 
> Have fun,
> Paulo Gaspar
> 
> http://www.krankikom.de
> http://www.ruhronline.de
> 
> 
> 
> > -----Original Message-----
> > From: Benoît Doret [mailto:benoit.doret@infodesign.com]
> > Sent: Friday, November 02, 2001 5:00 PM
> > To: lucene-dev@jakarta.apache.org
> > Subject: PDF parser for Lucene
> >
> >
> > Hello,
> >
> > Does a fully integrated PDF parser already exists for Lucene?
> > (some PDFDocument and PDFParser classes would be great!)
> > Does some sells it, or is it released as open source?
> > If not, does someone has already used an external java library to
> > parse pdf files and which one?
> >
> > Thanks in advance,
> > Benoît Doret.
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
> 
> 
> --
> To unsubscribe, e-mail:   
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message