lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "W. Eliot Kimber" <>
Subject Re: indexing and searching different file formats
Date Thu, 14 Feb 2002 17:10:51 GMT
Andrew Libby wrote:

> and the text needs to be retrieved for indexing.  An extreeme example is
> a PDF which has a considerably complicated document format.

The PJ library from provides a pretty complete and
easy-to-use API for getting info from PDF docs. It wouldn't be too hard
to write a PDF indexer for Lucene using this library. The main challenge
would be guessing word boundaries in strings where spaces have been
replaced with explicit shift values by the formatter.


W. Eliot Kimber,
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message