lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "W. Eliot Kimber" <el...@isogen.com>
Subject Re: indexing and searching different file formats
Date Thu, 14 Feb 2002 17:10:51 GMT
Andrew Libby wrote:

> and the text needs to be retrieved for indexing.  An extreeme example is
> a PDF which has a considerably complicated document format.

The PJ library from www.etymon.com provides a pretty complete and
easy-to-use API for getting info from PDF docs. It wouldn't be too hard
to write a PDF indexer for Lucene using this library. The main challenge
would be guessing word boundaries in strings where spaces have been
replaced with explicit shift values by the formatter.

Cheers,

Eliot
-- 
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message