lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Tan" <>
Subject Re: indexing PDF files
Date Sat, 04 May 2002 08:28:29 GMT
You might want to take a look at WebSearch It
has an _ok_ system going with respect to PDFs. PDFGo supports viewing of PDF
but a guy I contacted there says there's no current support for text
extraction but that he's "planning to do it".

Definitely agreed on the PJ resources bit. Doesn't really scale well in
terms of PDF file size.

If you haven't already seen the post, I once did a cursory examination of
the options for extracting text from PDF files via Java and the limitations
of the approaches.

The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far as
the libs I've seen so far, most of them are really concerned with the
display and manipulation of PDF pages. Since we're looking for something
less complex (i.e text extraction), maybe it's not so bad. I've spent abit
of time in this area before so feel free to email me offline about this. Not
sure how much help I can be though.

----- Original Message -----
From: "petite_abeille" <>
To: "Lucene Users List" <>
Sent: Friday, May 03, 2002 10:57 PM
Subject: Re: indexing PDF files

> On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:
> > Can I assume none of the poeple on the lucene user group had
> > implemented indexing a pdf document using lucene.
> Who knows...?!? In any case, it's not public knowledge...
> >  If some one has.. Please help me by providing the solution.
> I use to believe in Santa Claus also... ;-)
> All that said, there seems to be a real demand to do something about pdf
> to text conversion (in java preferably). I'm willing to invest some time
> and brain cell to nail it down, but I'm note sure where to start...
> I'm aware of the PJ library, but it's really a pig as far as resources
> goes. Anything else?
> Any (concrete) pointer appreciated.
> Thanks.
> PA.
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message