lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: indexing PDF files
Date Wed, 01 May 2002 15:41:56 GMT
> > Hm, this should be a FAQ.
> 
> Maybe it should... ;-)

It is now.

> > Check Lucene contributions page, there are some starting points
> there,
> 
> Well, this seems to be a very popular request... In fact I need 
> something like that also. Unfortunately, there seems to be no 
> authoritative answer as far as converting pdf files to text in a pure
> 
> Java environment... Maybe I'm missing something here as usual?
> 
> Also, on a related note, what would be a good approach to convert any
> 
> random document into pdf? I was thinking to have a two steps process
> for 
> document indexing in Lucene:
> 
> - First, convert everything to pdf (with Acrobat or something)
> - Second, convert pdf to text and index it.
> 
> Any practical suggestions about how to do that in a pure Java 
> environment very welcome.

Wouldn't you want to convert to XML instead and use XSLT to transform
the XML representation to any desired format by just applying a style
sheet?
Sounds like less work with bigger document type coverage.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message