incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis E. Hamilton" <dennis.hamil...@acm.org>
Subject RE: Resolving MS Word Binary File Format
Date Wed, 18 Jan 2012 01:55:09 GMT
I just downloaded the complete 70MB Zip of the complete set from the first link on this page:
<http://msdn.microsoft.com/en-us/library/cc313118.aspx>.

The [MS-DOC].pdf is over 600 pages.  So it is not a treat to implement from scratch, especially
with figuring out how to map into/from the OpenOffice.org model.  Having the code of an existing
converter would provide something to gut for structure and maybe even to morph rather than
do from scratch.  (There are also related documents that need to be consulted for specialized
aspects that are common across the Microsoft Office programs.)

Based on the quality that I found in the RTF specification, I suspect there is more than enough
to base an implementation on, but that is a superficial appraisal of this document.

There is also code that works with these formats (e.g., Apache Poi) and there may be other
converters that can be consulted.  I thought there was a relevant SourceForge project, but
Poi may be more current and active.

Consultation of the specifications for OOXML might also be helpful, since there is considerable
semantic harmony between those and the binaries, at least to a point.

No implementation can be done that is not test-driven and in particular heavily tested with
documents in and out of the Microsoft Office products.  

 - Dennis

-----Original Message-----
From: Andrea Pescetti [mailto:pescetti@apache.org] 
Sent: Tuesday, January 17, 2012 14:50
To: ooo-dev@incubator.apache.org
Subject: Re: Resolving MS Word Binary File Format

On 09/01/2012 Liang Weike wrote:
> I'm making an investigation of OpenOffice processing the documents of MS
> Office's binary file formats. ...
> So, has OpenOffice improved the flow and construction of resolving MS
> Office's binary file formats after MS offered the specification?

As far as I know, no substantial rewriting of the filters for
doc/xls/ppt files happened after Microsoft released the specification:
I've seen several incremental improvements over the years, but never a
complete rewrite.

I don't actually know if Microsoft released the specification in a form
that would make it easy to write an import filter from scratch.

Regards,
  Andrea.


Mime
View raw message