corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <>
Subject Re: ODF filter
Date Thu, 08 Jan 2015 16:18:30 GMT
> On 8 Jan 2015, at 10:59 pm, Peter Kelly <> wrote:
>> On 8 Jan 2015, at 10:16 am, Dave Fisher <> wrote:
>> Hi Peter,
>> This is a helpful email from your concrete discussion I can better understand the
mapping between the abstract / HTML model and the concrete / DOCX, ODT.
>> You mention differences in the style runs for Word and ODT of which I am familiar
from the OOXML side. Does the abstract model / HTML take a particular approach towards style
runs? Is there a concrete version of the HTML model? Is there a specification or plan for
the abstract model?
> As a general principle, no - a given filter is expected to handle arbitrary HTML.
> However, there is a function for “normalising” a HTML document to change nested sets
of inline elements (span, b, i, etc.) into a flat sequence of runs (each represented as a
span element). The Word filter uses this, due to Word’s flat model of inline runs.

Just thought I’d add a bit more detail on this, for anyone interested in exploring the implementation:

For .docx files, DFPut (api/src/Operations.c) calls WordPut (filters/ooxml/src/word/Word.c),
which in turn creates a WordPackage object and then calls WordPackageUpdateFromHTML (filters/ooxml/src/word/WordPackage.c).
The very first thing this does is to call HTML_normalizeDocument and HTML_pushDownInlineProperties
(both in core/src/html/DFHTMLNormalization.c).

HTML_normalizeDocument merges adjacent text nodes (which in theory shouldn’t be necessary,
but I found that sometimes libxml’s parser produces two or more in a row), and then goes
through all the block-level elements, flattening any inline elements such that the resulting
block node contains a series of spans, each with a style attribute set with the appropriate
css formatting properties. For example, if you start with this:


then you’ll end up with this:

    <span style=“font-weight: bold">
    <span style=“font-weight: bold; font-style: italic">
    <span style=“font-weight: bold">

HTML_pushDownInlineProperties checks block elements for any CSS properties that can be applied
to inline formatting (such as font family, font size, text color) and moves them to the style
attributes of the span elements within the block element. For example, the following:

<p style=“border: 1px solid black; font-size: 18”>
    <span>Some text</span>

would become this:

<p style=“border: 1px solid black”>
    <span style=“font-size: 18">Some text</span>

Both of these are pre-processing stages that happen before the primary traversal of the document
tree begins, and the latter code in the Word filter expects the HTML documents to confirm
to this more restrictive “dialect”. In the case of the inline properties, it’s because
these settings have to go on the rPr elements in a word document, and are not allowed on the
pPr elements (that is, Word is more strict in terms of which formatting properties can be
set where; HTML allows you to set “inline” formatting properties on any element using
a style attribute). So this pre-processing is largely to match the needs of the Word filter,
but it’s likely that an ODF text document filter will need some pre-processing as well.

As we add more formats, I expect we’ll discover some common places where there the HTML
input needs to be normalised to a certain form, and also places where it is better to leave
it as-is. The ability to have nested inline elements in ODF is an example of the latter; we
can probably avoid HTML_normalizeDocument in that case by having a direct relationship between
HTML inline elements and ODF text-span elements. Depending on the situation, retaining such
structure may be important - but that’s something I expect we’ll discover as we proceed
with implementation.

Dr Peter M. Kelly

PGP key: <>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message