lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krovi, DVSR_Sarma" <>
Subject RE: Lucene Help
Date Thu, 13 Apr 2006 08:00:05 GMT
You can use text extractors for the document formats you mentioned.
Lucene as such does not deal with this text extraction process.
Following are the extractors we generally use:
PDF 		-> PDFBox: Java API to read PDF documents
WORD		-> Antiword:
TXT		-> You can read the content using Java IO classes and
index them.
MSG		-> We currently using strings utility in Solaris that
reads printable characters from files.
XLS		-> Apache POI utils has classes to read Excel files. so
you can use that.
PPT/PPS	-> Apache POI's PowerPointExtractor
RTF		-> Java Swing has RTFEditorKit which we use to read RTF


-----Original Message-----
From: Shajahan [] 
Sent: Thursday, April 13, 2006 1:19 PM
Subject: Lucene Help

Hi all,

i am new to Lucene. i want to work indexing for PDF,word,txt files. can
one tell me how to dun indexing by Lucene. please give some informetion.

Thanking you
View this message in context:
Sent from the Lucene - Java Users forum at

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message