corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jan i <j...@apache.org>
Subject Re: ODF filter
Date Thu, 08 Jan 2015 16:11:48 GMT
On 8 January 2015 at 16:59, Peter Kelly <pmkelly@apache.org> wrote:

> > On 8 Jan 2015, at 10:16 am, Dave Fisher <dave2wave@comcast.net> wrote:
> >
> > Hi Peter,
> >
> > This is a helpful email from your concrete discussion I can better
> understand the mapping between the abstract / HTML model and the concrete /
> DOCX, ODT.
> >
> > You mention differences in the style runs for Word and ODT of which I am
> familiar from the OOXML side. Does the abstract model / HTML take a
> particular approach towards style runs? Is there a concrete version of the
> HTML model? Is there a specification or plan for the abstract model?
>
> As a general principle, no - a given filter is expected to handle
> arbitrary HTML.
>
> However, there is a function for “normalising” a HTML document to change
> nested sets of inline elements (span, b, i, etc.) into a flat sequence of
> runs (each represented as a span element). The Word filter uses this, due
> to Word’s flat model of inline runs.
>
> ODF text documents, on the other hand, *do* support nested formatting
> runs, so when writing this filter it may make sense not to apply the
> normalisation process used in the word filter. This should be done if there
> is information that could not be represented in HTML and would be lost by
> flattening the structure like we do for word.
>
> There’s been a few times where the topic of what internal representation
> we should use has been raised - whether we should stick with HTML, come up
> with our own entirely different model, or something else. I personally
> think HTML is a good choice, but perhaps for those who have raised the
> issue of an alternate intermediate form, this might be a good time to start
> that discussion ;)
>

Point taken, I am I assume the first who questioned it. But just to be
precise, I am happy having HTML as the internal structure, but I am unhappy
that filters can do what they like with the HTML. My goal is to define a
set of access functions that filters should use to navigate/insert/delete
tags and restrictions on what can be put in the tags. Just image one filter
needs to id some tags, therefore uses id=, another filter needs to name
some tags, therefore uses name=. If we are not careful here it will explode
and reading HTML becomes nearly as complicated as reading the formats
directly. We should have 1 and only 1 HTML definition, which the filters
can use.

rgds
jan I.

>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message