forrest-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From peter.dyks...@donovandata.com
Subject Re: Extracting content from Microsoft word into Forrest
Date Thu, 14 Oct 2004 15:47:45 GMT
Another approach which looks promising is to save the Word doc as html and 
then process it in Forrest using Forrest's html capabilities. This uses 
JTidy (which is built into Cocoon)  to convert the html to xhtml, which is 
a form of xml.

To do this I added a match in the forrest.xmap file resolver section to 
process 'htm' that mimics the processing for ihtml files with an added a 
transform step that applies a 'fixword.xsl' stylesheet, which does a few 
things like extract comments from the Word htm containing metadata 
(author, date saved, etc.) and passes them along as xml. (JTidy can also 
be set to drop all the Word comments automatically, but we decided some 
are useful.)

I am still having a problem that seems related to namespaces, though, 
which I do not fully understand.  To make the htm document work I have to 
go in and remove the xmlns attribute from the Word htm document. After I 
do that the document comes through fine in Forrest. If I don't do it, the 
tags aren't recognized. (I can't find a way to drop this out in JTidy, 
which helpfully inserts the 'proper' namespace information. Anyone have a 
tip about that?)




Ross Gardler <rgardler@apache.org> 
10/14/2004 06:16 AM
Please respond to
user@forrest.apache.org


To
user@forrest.apache.org
cc

Subject
Re: Extracting content from Microsoft word into Forrest






Juan Jose Pablos wrote:
> Kola Oyedeji escribió:
> 
>> I'm aware the latest version of office allows content to be stored in 
>> XML.
>> Does anyone know if any work has been done in this area? Is it possible
>> perhaps to extract content from word and transform it into a format for
>> Forrest?
>>
> Hi Kola
> 
> 
> I think that there is a lot of software that provides that funcionality, 

>  check this note on the xml.com site:
> 
> http://www.xml.com/pub/a/2003/12/31/qa.html
> 
> It would be nice if someone could report back a solution to this list 
> for benefict of other users.

 From the above page:

"Q: How can I convert a Microsoft Word document into XML?

A: Recent versions of Word claim "save as XML" features of one kind or 
another. Maybe that "claim" is too harsh; they do create well-formed XML 
documents, after all. But it's XML of a spectacularly hideous form, even 
for simple documents -- nearly as gnarly and impenetrable to the human 
eye as XSL-FO."

This has been my experience too.

Over on our project we have gone a different route. We use a version of 
Open Office running as a server to convert from MS format to Open 
Office. Forrest supports Open Office so, with this conversion process we 
can support MS Office too. The benefit of this approach is that it also 
supports earlier versions of MS Office and we have the whole Open Office 
community in writing decent converters.

Current status of this work is that we have an Eclipse plugin that 
allows MS Office files to be converted manually to Open Office files. It 
is my intention to move this code into a Cocoon generator so that 
Forrest/Cocoon can do the conversion on the fly. However, it is not a 
particulalry high priority right now.

If anyone would like to help, you can have our code as a starter.

Ross



Mime
View raw message