Another approach which looks promising
is to save the Word doc as html and then process it in Forrest using Forrest's
html capabilities. This uses JTidy (which is built into Cocoon) to
convert the html to xhtml, which is a form of xml.
To do this I added a match in the forrest.xmap
file resolver section to process 'htm' that mimics the processing for ihtml
files with an added a transform step that applies a 'fixword.xsl' stylesheet,
which does a few things like extract comments from the Word htm containing
metadata (author, date saved, etc.) and passes them along as xml. (JTidy
can also be set to drop all the Word comments automatically, but we decided
some are useful.)
I am still having a problem that seems
related to namespaces, though, which I do not fully understand. To
make the htm document work I have to go in and remove the xmlns attribute
from the Word htm document. After I do that the document comes through
fine in Forrest. If I don't do it, the tags aren't recognized. (I can't
find a way to drop this out in JTidy, which helpfully inserts the 'proper'
namespace information. Anyone have a tip about that?)
Ross Gardler <firstname.lastname@example.org>
10/14/2004 06:16 AM
Please respond to
Re: Extracting content from
Microsoft word into Forrest
Juan Jose Pablos wrote:
> Kola Oyedeji escribió:
>> I'm aware the latest version of office allows content to be stored
>> Does anyone know if any work has been done in this area? Is it
>> perhaps to extract content from word and transform it into a format
> Hi Kola
> I think that there is a lot of software that provides that funcionality,
> check this note on the xml.com site:
> It would be nice if someone could report back a solution to this list
> for benefict of other users.
From the above page:
"Q: How can I convert a Microsoft Word document into XML?
A: Recent versions of Word claim "save as XML" features of one
another. Maybe that "claim" is too harsh; they do create well-formed
documents, after all. But it's XML of a spectacularly hideous form, even
for simple documents -- nearly as gnarly and impenetrable to the human
eye as XSL-FO."
This has been my experience too.
Over on our project we have gone a different route. We use a version of
Open Office running as a server to convert from MS format to Open
Office. Forrest supports Open Office so, with this conversion process we
can support MS Office too. The benefit of this approach is that it also
supports earlier versions of MS Office and we have the whole Open Office
community in writing decent converters.
Current status of this work is that we have an Eclipse plugin that
allows MS Office files to be converted manually to Open Office files. It
is my intention to move this code into a Cocoon generator so that
Forrest/Cocoon can do the conversion on the fly. However, it is not a
particulalry high priority right now.
If anyone would like to help, you can have our code as a starter.