Return-Path: X-Original-To: apmail-corinthia-dev-archive@minotaur.apache.org Delivered-To: apmail-corinthia-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F0290180A8 for ; Wed, 17 Jun 2015 21:45:13 +0000 (UTC) Received: (qmail 4188 invoked by uid 500); 17 Jun 2015 21:45:13 -0000 Delivered-To: apmail-corinthia-dev-archive@corinthia.apache.org Received: (qmail 4155 invoked by uid 500); 17 Jun 2015 21:45:13 -0000 Mailing-List: contact dev-help@corinthia.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@corinthia.incubator.apache.org Delivered-To: mailing list dev@corinthia.incubator.apache.org Received: (qmail 4144 invoked by uid 99); 17 Jun 2015 21:45:13 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jun 2015 21:45:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 36D68C009B for ; Wed, 17 Jun 2015 21:45:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ceCZSadro6PI for ; Wed, 17 Jun 2015 21:45:04 +0000 (UTC) Received: from COL004-OMC1S14.hotmail.com (col004-omc1s14.hotmail.com [65.55.34.24]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 90E3043C92 for ; Wed, 17 Jun 2015 21:45:03 +0000 (UTC) Received: from COL401-EAS75 ([65.55.34.7]) by COL004-OMC1S14.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Wed, 17 Jun 2015 14:44:56 -0700 X-TMN: [zIe8Yp3QmM/7Ynpf5xZHo53csCfdkClC] X-Originating-Email: [franzdecopenhague@outlook.com] Message-ID: From: Franz de Copenhague To: References: In-Reply-To: Subject: RE: html ids Date: Wed, 17 Jun 2015 17:44:59 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 15.0 Thread-Index: AQABAgMEJwX7LJ3aJtn0ryh5HPEsAABfPaAFoU2NitA= Content-Language: en-us X-OriginalArrivalTime: 17 Jun 2015 21:44:56.0556 (UTC) FILETIME=[DA1812C0:01D0A946] >-----Original Message----- >From: Peter Kelly [mailto:pmkelly@apache.org] >Sent: Wednesday, June 17, 2015 3:54 PM >To: dev@corinthia.incubator.apache.org >Subject: Re: html ids > >> On 17 Jun 2015, at 8:09 pm, Ian C wrote: >> >> Hi Peter, >> >> when the Word converter creates an html element via the >> WordConverterCreateAbtract function it creates an associated id = attribute. >> >> Having examined the resulting html I see each element does have an = id. >> >> Are these necessary and if so when and where? I'm guessing some sort >> of lookup function somewhere? > >The id attributes are used for two purposes: > >1. To enable elements in an updated version of the document to be >correlated with the elements from the original version 2. As a target = for cross- >references to figures, tables and headings. > >The first one is the most important, since it applies to all elements, = instead of >only those that are targets of cross-references. > >The number included in the id attribute is the =E2=80=9Csequence = number=E2=80=9D of the >node in the document (the seqNo field of DFNode). During parsing, these = are >assigned sequentially, starting from 0; as a result, sequence numbers = in a >document immediately after parsing represent are in the same order as = they >appear in the originating XML file. > >This ordering does not really matter as such, but the consistency does = - two >parses of the same XML file are guaranteed to produce the same sequence >numbers. The update process (HTML -> docx) relies on this guarantee, = since it >re-parses the docx file from which the HTML was generated, and assumes >that the ids in the HTML match up with the sequence numbers obtained = from >the parse. > >When new nodes are added to a document after parsing, the are assigned >new sequence numbers consecutively, starting with the first number = after >what has been assigned so far. > >DFDocument maintains a mapping from id attributes to Nodes. So if you = have >a node in the document.xml file, say, and you want to find the = corresponding >HTML element (if it exists), then you construct a string with the id = prefix and >the sequence number, and then do a lookup in the nodesByIdAttr hash = table >of the DFDocument object. There is a convenience function that does = this, >called DFElementForIdAttr(). This function is used in WordBookmarks and >WordFields for dealing with cross-references. > >WordConverterCreateAbstract() is used for creating a HTML element in = the >=E2=80=98get=E2=80=99 operation. It sets the id attribute based on the = prefix used during >conversion, and the sequence number of the supplied concrete element. = This >sets up the relationship, which is subsequently used in the = =E2=80=98put=E2=80=99 operation. > >WordConverterGetConcrete() does the reverse. It takes as input a HTML >element from the abstract document, and checks to see if it has an id >attribute. If so, it extracts the sequence number from the attribute, = and uses >that to locate the concrete element (typically in document.xml) from = which >that HTML element was originally derived. > >Once it has determined the sequence number, WordConverterGetConcrete() >calls DFNodeForSeqNo(), which uses a hash table maintained by the >document to map sequence numbers to nodes. The result may be NULL, >indicating that there is no such node in the document, though in = general that=E2=80=99s >unlikely. > >The most important use of WordConverterGetConcrete() is in >WordContainerPut(), which is a wrapper around BDTContainerPut. The >BDTContainerPut function is what handles the re-ordering of nodes (e.g. = if a >paragraph was moved to a different part of the HTML document, we move = it=E2=80=99s >counterpart in document.xml, retaining all supported and unsupported >properties, e.g. certain formatting options that can=E2=80=99t be = expressed in HTML). > >Hope this clears things up a little bit=E2=80=A6 let me know if you = need me to clarify >anything further. > >And yes, I believe we=E2=80=99ll need the same thing for ODF, in order = to properly >handle bidirectional transformation, which allows us to preserve = aspects of >the ODF document that we don=E2=80=99t yet (or can=E2=80=99t) express = in HTML. Perhaps this >can be abstracted in a generic manner so that it can be used by both = filters >(and others in the future). > >=E2=80=94 >Dr Peter M. Kelly >pmkelly@apache.org > >PGP key: http://www.kellypmk.net/pgp-key > (fingerprint 5435 6718 59F0 DD1F BFA0 >5E46 2523 BAA1 44AE 2966) I think that I did comment previously, using data-* attribute for the = persistency of DFNode sequence number, instead of the HMTL id. This is = limitation to the client app that cannot manipulate the HTML id for its = own purpose. http://www.w3.org/TR/2011/WD-html5-20110525/elements.html#embedding-custo= m-non-visible-data-with-the-data-attributes franz