corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis E. Hamilton" <dennis.hamil...@acm.org>
Subject Corinthia Document Model (was RE: ODF filter)
Date Thu, 08 Jan 2015 16:40:39 GMT
 -- reply below to --
From: jan i [mailto:jani@apache.org] 
Sent: Thursday, January 8, 2015 08:12
To: dev@corinthia.incubator.apache.org
Subject: Re: ODF filter

On 8 January 2015 at 16:59, Peter Kelly <pmkelly@apache.org> wrote:

[ ... ]

> As a general principle, no - a given filter is expected to handle
> arbitrary HTML.
>
> However, there is a function for “normalising” a HTML document to change
> nested sets of inline elements (span, b, i, etc.) into a flat sequence of
> runs (each represented as a span element). The Word filter uses this, due
> to Word’s flat model of inline runs.
>
> ODF text documents, on the other hand, *do* support nested formatting
> runs, so when writing this filter it may make sense not to apply the
> normalisation process used in the word filter. This should be done if there
> is information that could not be represented in HTML and would be lost by
> flattening the structure like we do for word.
>
> There’s been a few times where the topic of what internal representation
> we should use has been raised - whether we should stick with HTML, come up
> with our own entirely different model, or something else. I personally
> think HTML is a good choice, but perhaps for those who have raised the
> issue of an alternate intermediate form, this might be a good time to start
> that discussion ;)
>

Point taken, I am I assume the first who questioned it. But just to be
precise, I am happy having HTML as the internal structure, but I am unhappy
that filters can do what they like with the HTML. My goal is to define a
set of access functions that filters should use to navigate/insert/delete
tags and restrictions on what can be put in the tags. Just image one filter
needs to id some tags, therefore uses id=, another filter needs to name
some tags, therefore uses name=. If we are not careful here it will explode
and reading HTML becomes nearly as complicated as reading the formats
directly. We should have 1 and only 1 HTML definition, which the filters
can use.

rgds
jan I.

<orcmid>
  I'm not following this well.  
  Let me ask it this way: Are we talking about fixing some sort of DOM over
  the HTML5 or are we allowing arbitrary HTML5 and transforming to and from
  it? 
 
  I am having trouble visualizing this process -- is the intermediate
  concrete HTML and not some DOM view?

  This relates to how inter-conversion is to be tested.  Is there some 
  abstraction against which document features are assessed and mapped
  through or are we working concrete level to/from concrete level and
  that is essentially it?

  Help me calibrate my understanding of the thrust.
</orcmid>



Mime
View raw message