corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <pmke...@apache.org>
Subject Re: html ids
Date Wed, 17 Jun 2015 19:53:58 GMT
> On 17 Jun 2015, at 8:09 pm, Ian C <ian@amham.net> wrote:
> 
> Hi Peter,
> 
> when the Word converter creates an html element via the
> WordConverterCreateAbtract function it creates an associated id attribute.
> 
> Having examined the resulting html I see each element does have an id.
> 
> Are these necessary and if so when and where? I'm guessing some sort of
> lookup function somewhere?

The id attributes are used for two purposes:

1. To enable elements in an updated version of the document to be correlated with the elements
from the original version
2. As a target for cross-references to figures, tables and headings.

The first one is the most important, since it applies to all elements, instead of only those
that are targets of cross-references.

The number included in the id attribute is the “sequence number” of the node in the document
(the seqNo field of DFNode). During parsing, these are assigned sequentially, starting from
0; as a result, sequence numbers in a document immediately after parsing represent are in
the same order as they appear in the originating XML file.

This ordering does not really matter as such, but the consistency does - two parses of the
same XML file are guaranteed to produce the same sequence numbers. The update process (HTML
-> docx) relies on this guarantee, since it re-parses the docx file from which the HTML
was generated, and assumes that the ids in the HTML match up with the sequence numbers obtained
from the parse.

When new nodes are added to a document after parsing, the are assigned new sequence numbers
consecutively, starting with the first number after what has been assigned so far.

DFDocument maintains a mapping from id attributes to Nodes. So if you have a node in the document.xml
file, say, and you want to find the corresponding HTML element (if it exists), then you construct
a string with the id prefix and the sequence number, and then do a lookup in the nodesByIdAttr
hash table of the DFDocument object. There is a convenience function that does this, called
DFElementForIdAttr(). This function is used in WordBookmarks and WordFields for dealing with
cross-references.

WordConverterCreateAbstract() is used for creating a HTML element in the ‘get’ operation.
It sets the id attribute based on the prefix used during conversion, and the sequence number
of the supplied concrete element. This sets up the relationship, which is subsequently used
in the ‘put’ operation.

WordConverterGetConcrete() does the reverse. It takes as input a HTML element from the abstract
document, and checks to see if it has an id attribute. If so, it extracts the sequence number
from the attribute, and uses that to locate the concrete element (typically in document.xml)
from which that HTML element was originally derived. 

Once it has determined the sequence number, WordConverterGetConcrete() calls DFNodeForSeqNo(),
which uses a hash table maintained by the document to map sequence numbers to nodes. The result
may be NULL, indicating that there is no such node in the document, though in general that’s
unlikely.

The most important use of WordConverterGetConcrete() is in WordContainerPut(), which is a
wrapper around BDTContainerPut. The BDTContainerPut function is what handles the re-ordering
of nodes (e.g. if a paragraph was moved to a different part of the HTML document, we move
it’s counterpart in document.xml, retaining all supported and unsupported properties, e.g.
certain formatting options that can’t be expressed in HTML).

Hope this clears things up a little bit… let me know if you need me to clarify anything
further.

And yes, I believe we’ll need the same thing for ODF, in order to properly handle bidirectional
transformation, which allows us to preserve aspects of the ODF document that we don’t yet
(or can’t) express in HTML. Perhaps this can be abstracted in a generic manner so that it
can be used by both filters (and others in the future).

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message