corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian C <...@amham.net>
Subject Re: html ids
Date Thu, 18 Jun 2015 01:21:49 GMT
On Thu, Jun 18, 2015 at 3:53 AM, Peter Kelly <pmkelly@apache.org> wrote:

> > On 17 Jun 2015, at 8:09 pm, Ian C <ian@amham.net> wrote:
> >
> > Hi Peter,
> >
> > when the Word converter creates an html element via the
> > WordConverterCreateAbtract function it creates an associated id
> attribute.
> >
> > Having examined the resulting html I see each element does have an id.
> >
> > Are these necessary and if so when and where? I'm guessing some sort of
> > lookup function somewhere?
>
> The id attributes are used for two purposes:
>
> 1. To enable elements in an updated version of the document to be
> correlated with the elements from the original version
> 2. As a target for cross-references to figures, tables and headings.
>
> The first one is the most important, since it applies to all elements,
> instead of only those that are targets of cross-references.
>
> The number included in the id attribute is the “sequence number” of the
> node in the document (the seqNo field of DFNode). During parsing, these are
> assigned sequentially, starting from 0; as a result, sequence numbers in a
> document immediately after parsing represent are in the same order as they
> appear in the originating XML file.
>
> This ordering does not really matter as such, but the consistency does -
> two parses of the same XML file are guaranteed to produce the same sequence
> numbers. The update process (HTML -> docx) relies on this guarantee, since
> it re-parses the docx file from which the HTML was generated, and assumes
> that the ids in the HTML match up with the sequence numbers obtained from
> the parse.
>
> When new nodes are added to a document after parsing, the are assigned new
> sequence numbers consecutively, starting with the first number after what
> has been assigned so far.
>
> DFDocument maintains a mapping from id attributes to Nodes. So if you have
> a node in the document.xml file, say, and you want to find the
> corresponding HTML element (if it exists), then you construct a string with
> the id prefix and the sequence number, and then do a lookup in the
> nodesByIdAttr hash table of the DFDocument object. There is a convenience
> function that does this, called DFElementForIdAttr(). This function is used
> in WordBookmarks and WordFields for dealing with cross-references.
>
> WordConverterCreateAbstract() is used for creating a HTML element in the
> ‘get’ operation. It sets the id attribute based on the prefix used during
> conversion, and the sequence number of the supplied concrete element. This
> sets up the relationship, which is subsequently used in the ‘put’ operation.
>
> WordConverterGetConcrete() does the reverse. It takes as input a HTML
> element from the abstract document, and checks to see if it has an id
> attribute. If so, it extracts the sequence number from the attribute, and
> uses that to locate the concrete element (typically in document.xml) from
> which that HTML element was originally derived.
>
> Once it has determined the sequence number, WordConverterGetConcrete()
> calls DFNodeForSeqNo(), which uses a hash table maintained by the document
> to map sequence numbers to nodes. The result may be NULL, indicating that
> there is no such node in the document, though in general that’s unlikely.
>
> The most important use of WordConverterGetConcrete() is in
> WordContainerPut(), which is a wrapper around BDTContainerPut. The
> BDTContainerPut function is what handles the re-ordering of nodes (e.g. if
> a paragraph was moved to a different part of the HTML document, we move
> it’s counterpart in document.xml, retaining all supported and unsupported
> properties, e.g. certain formatting options that can’t be expressed in
> HTML).
>
> Hope this clears things up a little bit… let me know if you need me to
> clarify anything further.
>

Thanks Peter, I have been bashing the ODF Filter to use the lenses and came
across the use of IDs. Now I am clearer as to why they are there.
At the moment I simply have the framework and can generate an HTML document
that has an empty body element. Next I'll take the test headers document we
have and map its contents.
Then iteratively add more style features. (My tool is useful in seeing what
has been used and what is still to be processed)


> And yes, I believe we’ll need the same thing for ODF, in order to properly
> handle bidirectional transformation, which allows us to preserve aspects of
> the ODF document that we don’t yet (or can’t) express in HTML. Perhaps this
> can be abstracted in a generic manner so that it can be used by both
> filters (and others in the future).
>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>


-- 
Cheers,

Ian C

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message