lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "CNew" <c...@fuse.net>
Subject Re: indexing PDF files
Date Mon, 06 May 2002 00:59:46 GMT
I think most of the PDF creation knowledge using Java resides in the iText
and FOP projects.

both open source.

I would seem that java-pdf-writing code would be a good place to start on
java-pdf-reading code.

just a thought.

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Saturday, May 04, 2002 1:28 AM
Subject: Re: indexing PDF files


> You might want to take a look at WebSearch http://www.i2a.com/websearch/.
It
> has an _ok_ system going with respect to PDFs. PDFGo supports viewing of
PDF
> but a guy I contacted there says there's no current support for text
> extraction but that he's "planning to do it".
>
> Definitely agreed on the PJ resources bit. Doesn't really scale well in
> terms of PDF file size.
>
> If you haven't already seen the post, I once did a cursory examination of
> the options for extracting text from PDF files via Java and the
limitations
> of the approaches.
> http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html
>
> The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far
as
> the libs I've seen so far, most of them are really concerned with the
> display and manipulation of PDF pages. Since we're looking for something
> less complex (i.e text extraction), maybe it's not so bad. I've spent abit
> of time in this area before so feel free to email me offline about this.
Not
> sure how much help I can be though.
>
> ----- Original Message -----
> From: "petite_abeille" <petite_abeille@mac.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Friday, May 03, 2002 10:57 PM
> Subject: Re: indexing PDF files
>
>
> > On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:
> >
> > > Can I assume none of the poeple on the lucene user group had
> > > implemented indexing a pdf document using lucene.
> >
> > Who knows...?!? In any case, it's not public knowledge...
> >
> > >  If some one has.. Please help me by providing the solution.
> >
> > I use to believe in Santa Claus also... ;-)
> >
> > All that said, there seems to be a real demand to do something about pdf
> > to text conversion (in java preferably). I'm willing to invest some time
> > and brain cell to nail it down, but I'm note sure where to start...
> >
> > I'm aware of the PJ library, but it's really a pig as far as resources
> > goes. Anything else?
> >
> > Any (concrete) pointer appreciated.
> >
> > Thanks.
> >
> > PA.
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message