corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <>
Subject ODF filter
Date Wed, 07 Jan 2015 13:57:55 GMT
I mentioned in my last mail the topic of writing an ODF filter. I realise the codebase is pretty
difficult to navigate right now due to lack of documentation, so I thought I’d get the discussion
started by outlining how I would suggest we proceed with this, based on my experience writing
the Word filter (I tend to use the term “Word” rather than OOXML, since the currently
implementation details only with the word processing subset of the spec; similarly for ODF
for now).

At a high-level, each filter needs to provide three operations: get, put, and create. These
operate on “abstract” and “concrete” documents - an abstract document is in HTML format
(our common intermediate representation) and the concrete document is in format which the
filter is implementing (in this case, .odt).

The get operation will need to convert from ODT to HTML, and include id attributes in the
HTML file that allow elements in the latter to be correlated with elements in the former.
In the Word filter, the ids are based on the index of the node in a pre-order traversal of
the tree. These are used to look up elements during the put operation, so we know which element
to update.

The put operation will need to accept an existing ODT document, and update it based on a modified
version of the HTML file that was previously obtained from the get operation. The way I did
this in the word filter was to traverse both trees in “parallel”, determining what had
changed (and using the element mappings based on id attributes), making changes to the original
document as appropriate. In the case of formatting attributes, this involved re-generating
the CSS from the concrete document, comparing which attributes had changed, and then applying
the necessary changes to the formatting elements in the concrete document. In the case of
content, this was handled differently, generally simply overwriting.

During traversal, the functions in DFBDT.c can be used to handle case where the children of
a given element have been re-ordered (e.g. someone moved a paragraph to different position
in the document). This uses the id mappings in the HTML to figure out what elements in the
concrete document they correspond to, and when it sees them in a different order, it moves
some of them so that they come to match the order in which the corresponding HTML elements
appear. Unsupported elements are left untouched by this process.

The create operation will need to produce a brand new ODT file based on a HTML file. This
can simply be implemented by creating an empty ODT file, and then doing a put operation -
it’s essentially “updating” an empty document to which new content has been added.

The entry points for these three functions are DFGet, DFPut, and DFCreate in api/src/Operations.c.
These each have a switch statement which looks at the file type and calls through to a function
in the appropriate filter to do the conversion. In the future we may need a more generic/pluggable
way of doing this, but for the time being, defining three functions ODTGet, ODTPut, and ODTCreate
(corresponding to the existing WordGet, WordPut, and WordCreate functions) and adding cases
to the switch statements for these will be sufficient.

It’s probably best to start off by having a look at these functions in filters/ooxml/src/word/Word.c
and following the code through there. If you’re using Xcode, you can easily jump through
the function call graph to go to the implementation of a called function; I expect visual
studio probably has something similar. At any rate, I’ve mostly chosen function names that
are not prefixes of other function names, so it should be fairly easy to find the function
you’re looking for with grep if you don’t know what file it’s in (this is something
I love about C, which you can’t do so easily using object-oriented languages).

The Word filter has two core classes used during conversion - WordPackage and WordConverter
(defined in their respective .h and .c files). A word package encapsulates a .docx file, and
contains data structures loaded from the XML files stored within the .docx package (which
is actually a zip file). There are classes for things like the stylesheet, numbering information,
the set of footnotes/endnotes, and so forth. For ODF,  I already did a little bit of work
a while back defining skeleton versions of the corresponding classes (ODFPackage, ODFManifest,
and ODFSheet). The file ODF.c is empty but would be a suitable place to put the get/put/create

Data structures used in ODF differ somewhat from those of Word documents, though there is
a lot of conceptual similarity. The most significant difference I can think of is the way
that direct formatting is handled - ODF treats *everything* as a style; if you apply direct
formatting to a run of text, then it creates what’s called an “automatic style” and
references that from the content. So styles, formatting, numbering, and numerous other things
will have to be represented differently, but much of the strategies used in the word filter
should carry across fairly easily. I need to document these better, but perhaps it’s easiest
if you get stuck to ask me questions, and then we can put these on the wiki or in the source

Anyway, this is just a braindump of what I think the most relevant things someone implementing
an ODF filter will need to know. I’d love to be be pestered with more questions about this,
as I think getting started on this important task would be a good step forward for the project,
and demonstrate our commitment to making interoperability easier for people.

Dr Peter M. Kelly

PGP key: <>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message