lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krovi, DVSR_Sarma" <DVSR.Sarma.Kr...@deshaw.com>
Subject RE: Lucene Help
Date Thu, 13 Apr 2006 08:00:05 GMT
You can use text extractors for the document formats you mentioned.
Lucene as such does not deal with this text extraction process.
Following are the extractors we generally use:
PDF 		-> PDFBox: Java API to read PDF documents
http://www.pdfbox.org.
WORD		-> Antiword: http://www.winfield.demon.nl/
TXT		-> You can read the content using Java IO classes and
index them.
MSG		-> We currently using strings utility in Solaris that
reads printable characters from files.
XLS		-> Apache POI utils has classes to read Excel files. so
you can use that.
PPT/PPS	-> Apache POI's PowerPointExtractor
RTF		-> Java Swing has RTFEditorKit which we use to read RTF
documents.

Krovi.

-----Original Message-----
From: Shajahan [mailto:shaikshajahansha@yahoo.co.in] 
Sent: Thursday, April 13, 2006 1:19 PM
To: java-user@lucene.apache.org
Subject: Lucene Help



Hi all,

i am new to Lucene. i want to work indexing for PDF,word,txt files. can
any
one tell me how to dun indexing by Lucene. please give some informetion.

Thanking you
shaik
--
View this message in context:
http://www.nabble.com/Lucene-Help-t1442764.html#a3896122
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message