lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergiu Gordea <gser...@ifit.uni-klu.ac.at>
Subject Index MSOffice Documents
Date Fri, 25 Jun 2004 12:42:12 GMT
Hi all,

 I'm working on a project in which we are building a knowledge 
management platform. We are using Turbine/Velocity
as framework and we are using lucene for search.

 We want to make the search to be able to index MSOffice Documents, 
therefore I was searching for some possibilities to extract the text 
from this
documents. I found some examples based on POI library 
(http://jakarta.apache.org/poi) and I addapted them to our needs.
The extraction of the text elements from XLS file I think is trustable 
(the from POI development comunity did a great job with the package that
work with XSL files). The examples that extract the text from DOC and 
PPT files are not very general, I think they have problems with the 
documents
written with special charsets but they are working just well on the 
documents I use. I hope someone that has more experience that I have 
will improve this
and will a better source code.

 Congratulations to all people involved in development of the Jakarta 
project and it's subprojects,

 Sergiu Gordea

Ps: ExeConverteImpl uses an external stand alone application (like 
antiwort or pdf2txt) to extract the text.

Mime
View raw message