lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: RE : Parsers
Date Thu, 29 May 2003 08:51:33 GMT
David Warnock wrote:
> Andrzej,
> 
> Another solution for all MS Office formats is to use openoffice.org the 
> latest betas have a powerful Java SDK. So for example you could script a 
> central copy to open MS Docs and save as html for parsing in lucene. Or 
> you could save in Openoffice.org formats (which are zipped xml) and 
> throw those at lucene.
> 
> Dave
> 
>>> Another solution is to use Microsoft Office itself. You can setup a 
>>> server that serve request to convert Microsoft Office doc. There are 
>>> many ways of doing this, for example using Python to directly call 
>>> Office then put your python script in a webserver.
> 
> 
> 

Yes, I checked this solution in the past, but (unless something changed 
drastically) OpenOffice converters and Java integration are coupled 
tightly with the whole suite, so basically you have to install the whole 
suite (50MB?) just to be able to use the converters. In my case (a 
desktop utility) that would be an overkill... However, for server-based 
converters this could make a lot of sense - but then I believe you can 
work directly with the internal OO object model instead of xml files.

And I agree that their Java SDK has almost everything you may want, even 
a nice document bean that allows you to work with a document editor in 
JComponent.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message