corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Fisher <dave2w...@comcast.net>
Subject Re: ODF filter
Date Thu, 08 Jan 2015 03:16:31 GMT
Hi Peter,

This is a helpful email from your concrete discussion I can better understand the mapping
between the abstract / HTML model and the concrete / DOCX, ODT.

You mention differences in the style runs for Word and ODT of which I am familiar from the
OOXML side. Does the abstract model / HTML take a particular approach towards style runs?
Is there a concrete version of the HTML model? Is there a specification or plan for the abstract
model?

I also think that one approach towards other file format filters that could be interesting
would be to focus on PUT functionality before GET. Understanding how to write a proper document
is the first step towards reading documents in all of the historical variations. PDF is a
classic example of this. Adobe has always done well defining what a valid PDF document looks
like, but there are after 24 years myriad variants that are valid.

Regards,
Dave

On Jan 7, 2015, at 5:57 AM, Peter Kelly wrote:

> I mentioned in my last mail the topic of writing an ODF filter. I realise the codebase
is pretty difficult to navigate right now due to lack of documentation, so I thought I’d
get the discussion started by outlining how I would suggest we proceed with this, based on
my experience writing the Word filter (I tend to use the term “Word” rather than OOXML,
since the currently implementation details only with the word processing subset of the spec;
similarly for ODF for now).
> 
> At a high-level, each filter needs to provide three operations: get, put, and create.
These operate on “abstract” and “concrete” documents - an abstract document is in
HTML format (our common intermediate representation) and the concrete document is in format
which the filter is implementing (in this case, .odt).
> 
> The get operation will need to convert from ODT to HTML, and include id attributes in
the HTML file that allow elements in the latter to be correlated with elements in the former.
In the Word filter, the ids are based on the index of the node in a pre-order traversal of
the tree. These are used to look up elements during the put operation, so we know which element
to update.
> 
> The put operation will need to accept an existing ODT document, and update it based on
a modified version of the HTML file that was previously obtained from the get operation. The
way I did this in the word filter was to traverse both trees in “parallel”, determining
what had changed (and using the element mappings based on id attributes), making changes to
the original document as appropriate. In the case of formatting attributes, this involved
re-generating the CSS from the concrete document, comparing which attributes had changed,
and then applying the necessary changes to the formatting elements in the concrete document.
In the case of content, this was handled differently, generally simply overwriting.
> 
> During traversal, the functions in DFBDT.c can be used to handle case where the children
of a given element have been re-ordered (e.g. someone moved a paragraph to different position
in the document). This uses the id mappings in the HTML to figure out what elements in the
concrete document they correspond to, and when it sees them in a different order, it moves
some of them so that they come to match the order in which the corresponding HTML elements
appear. Unsupported elements are left untouched by this process.
> 
> The create operation will need to produce a brand new ODT file based on a HTML file.
This can simply be implemented by creating an empty ODT file, and then doing a put operation
- it’s essentially “updating” an empty document to which new content has been added.
> 
> The entry points for these three functions are DFGet, DFPut, and DFCreate in api/src/Operations.c.
These each have a switch statement which looks at the file type and calls through to a function
in the appropriate filter to do the conversion. In the future we may need a more generic/pluggable
way of doing this, but for the time being, defining three functions ODTGet, ODTPut, and ODTCreate
(corresponding to the existing WordGet, WordPut, and WordCreate functions) and adding cases
to the switch statements for these will be sufficient.
> 
> It’s probably best to start off by having a look at these functions in filters/ooxml/src/word/Word.c
and following the code through there. If you’re using Xcode, you can easily jump through
the function call graph to go to the implementation of a called function; I expect visual
studio probably has something similar. At any rate, I’ve mostly chosen function names that
are not prefixes of other function names, so it should be fairly easy to find the function
you’re looking for with grep if you don’t know what file it’s in (this is something
I love about C, which you can’t do so easily using object-oriented languages).
> 
> The Word filter has two core classes used during conversion - WordPackage and WordConverter
(defined in their respective .h and .c files). A word package encapsulates a .docx file, and
contains data structures loaded from the XML files stored within the .docx package (which
is actually a zip file). There are classes for things like the stylesheet, numbering information,
the set of footnotes/endnotes, and so forth. For ODF,  I already did a little bit of work
a while back defining skeleton versions of the corresponding classes (ODFPackage, ODFManifest,
and ODFSheet). The file ODF.c is empty but would be a suitable place to put the get/put/create
functions.
> 
> Data structures used in ODF differ somewhat from those of Word documents, though there
is a lot of conceptual similarity. The most significant difference I can think of is the way
that direct formatting is handled - ODF treats *everything* as a style; if you apply direct
formatting to a run of text, then it creates what’s called an “automatic style” and
references that from the content. So styles, formatting, numbering, and numerous other things
will have to be represented differently, but much of the strategies used in the word filter
should carry across fairly easily. I need to document these better, but perhaps it’s easiest
if you get stuck to ask me questions, and then we can put these on the wiki or in the source
documentation.
> 
> Anyway, this is just a braindump of what I think the most relevant things someone implementing
an ODF filter will need to know. I’d love to be be pestered with more questions about this,
as I think getting started on this important task would be a good step forward for the project,
and demonstrate our commitment to making interoperability easier for people.
> 
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
> 
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> 


Mime
View raw message