forrest-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ross Gardler <rgard...@apache.org>
Subject Re: Extracting content from Microsoft word into Forrest
Date Thu, 14 Oct 2004 10:16:54 GMT
Juan Jose Pablos wrote:
> Kola Oyedeji escribió:
> 
>> I'm aware the latest version of office allows content to be stored in 
>> XML.
>> Does anyone know if any work has been done in this area? Is it possible
>> perhaps to extract content from word and transform it into a format for
>> Forrest?
>>
> Hi Kola
> 
> 
> I think that there is a lot of software that provides that funcionality, 
>  check this note on the xml.com site:
> 
> http://www.xml.com/pub/a/2003/12/31/qa.html
> 
> It would be nice if someone could report back a solution to this list 
> for benefict of other users.

 From the above page:

"Q: How can I convert a Microsoft Word document into XML?

A: Recent versions of Word claim "save as XML" features of one kind or 
another. Maybe that "claim" is too harsh; they do create well-formed XML 
documents, after all. But it's XML of a spectacularly hideous form, even 
for simple documents -- nearly as gnarly and impenetrable to the human 
eye as XSL-FO."

This has been my experience too.

Over on our project we have gone a different route. We use a version of 
Open Office running as a server to convert from MS format to Open 
Office. Forrest supports Open Office so, with this conversion process we 
can support MS Office too. The benefit of this approach is that it also 
supports earlier versions of MS Office and we have the whole Open Office 
community in writing decent converters.

Current status of this work is that we have an Eclipse plugin that 
allows MS Office files to be converted manually to Open Office files. It 
is my intention to move this code into a Cocoon generator so that 
Forrest/Cocoon can do the conversion on the fly. However, it is not a 
particulalry high priority right now.

If anyone would like to help, you can have our code as a starter.

Ross


Mime
View raw message