corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <>
Subject Re: Corinthia Document Model (was RE: ODF filter)
Date Thu, 08 Jan 2015 17:57:36 GMT
> On 9 Jan 2015, at 12:02 am, jan i <> wrote:
> Without polluting with all the function calls, let me try to explain, how I
> see the current source (peter@ please correct me if I am wrong).
> a filter can in principle inject any HTML5 string into the datamodel. Core
> delivers functions to manipulate the HTML5 model, but does not control what
> happens.
> Meaning if a filter wants to write "<p style=janPrivate,
> idJan=nogo>foo</p>" to the data, it can do that. The problem with that is
> that all the other filters need to understand this, when reading data and
> generating their format.

Just to clarify on the representation - it's a DOM-like model, in that we have a tree data
structure with nodes (elements and text nodes), where elements can have attributes. It's very
similar to the W3C DOM but some of the function names and field names are different, and it
doesn't use inheritance (due to C being the implementation language). There is no string concatenation
going on during conversion - the DOM tree is parsed and serialised to XML or HTML in the standard

> My idea is that core should provide function like (just an example)
>   addParagraph(*style, *id, *text)
> Doing that means a filter cannot write arbitrary HTML5 but only what is
> "allowed". If a filter need a new capability, core would be extended in a
> controlled fashion and all filters updated.

One approach - admittedly radical (but don't let that stop us) - is to enforce this at the
level of the type system, based on the HTML DTD, as well as possibly the XML schema definition
for the individual file formats. Unfortunately however, C's type system isn't really powerful
enough to express the sort of constraints we'd want to enforce; Haskell is the only language
I know of which is.

The parsing toolkit I'm working on (based on PEG - see takes a
grammar as input and produces a syntax tree (currently in a custom data structure, but could
easily produce the syntax tree in XML or similar). I'm interested in taking this idea further,
and making the grammar and type system one and the same, and use this to define a high-level
functional language in which transformations could be expressed. Things like union types are
really important here, which Haskell does well but few other languages, but the concept of
union types has been alive and well in formal grammars since the beginning - that is, multiple
different possible ways of matching a given production.

I've worked a lot with Stratego/XT ( in the past and have been inspired
by it's unique to approach to expressing language transformations. I think something like
this would be very well suited to what we want to do. My main problem with Stratego however
is it's untyped; you can't enforce the restriction that a particular transformation results
in a particular type/structure, nor can you specify the types of structure it accepts. I think
a language that merges the concepts of stratego's transformation stategy, haskell's type system,
and PEG-based formal grammars would be a very powerful and elegant way to achieve our goals.

My primary motivation for using formal grammars is to give us the ability to handle non-XML
based languages, such as Markdown, RTF, LaTeX etc. With a suitable parser implementation,
we can deal with these just as easily as we can with any other XML-based structure - and in
fact we could even move to a higher level of abstraction where XML is just a special case
of the more general type system. XML Schema and Relax NG (used for the OOXML and ODF specs
respectively, if I remember correctly) could also be used as inputs to the type system, and
used for static typing.

A programming language of this nature would allow us to formally specify the exact nature
of the intermediate form (be it a dialect of HTML or otherwise), and get static type checking
of the transformation code to a degree that can't be achieved with C/C++ or other similar
languages. More static type checking also has the potential to reduce the number of required
testcases, as we can eliminate whole classes of errors through the type system.

>>  This relates to how inter-conversion is to be tested.  Is there some
>>  abstraction against which document features are assessed and mapped
>>  through or are we working concrete level to/from concrete level and
>>  that is essentially it?
> I dont think we should test inter-conversion as such. It is much more
> efficient to format xyz <-> HTML5. And if our usage of HTML5 is defined
> (and restricted) it should work.

Agreed. Think of it like the frontend and backend parts of a compiler. If you want to support
N languages on M CPU architectures, then you would generally have a CPU-independent intermediate
representation (essentially a high-level assembly language). You write a frontend for each
of the N languages which targets this intermediate, abstract machine (including language-specific
optimisations). You also write a backend for each of the M target CPU architectures (including
architecture-specific optimisations). You then need N+M tests, instead of N*M.

In our case, HTML is the "intermediate architecture", or more appropriately, "intermediate
format". Each filter knows about it's own format (e.g. .docx) and HTML. It solely deals with
the conversion between these formats.

If you want to convert from say .docx to .odt, then you first go through HTML as an intermediate
step. So the file gets converted from .docx to HTML, and then from HTML to .odt.

Dr Peter M. Kelly

PGP key: <>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message