Return-Path: X-Original-To: apmail-corinthia-dev-archive@minotaur.apache.org Delivered-To: apmail-corinthia-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 11E02C5B9 for ; Thu, 8 Jan 2015 16:20:41 +0000 (UTC) Received: (qmail 55709 invoked by uid 500); 8 Jan 2015 16:20:42 -0000 Delivered-To: apmail-corinthia-dev-archive@corinthia.apache.org Received: (qmail 55685 invoked by uid 500); 8 Jan 2015 16:20:42 -0000 Mailing-List: contact dev-help@corinthia.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@corinthia.incubator.apache.org Delivered-To: mailing list dev@corinthia.incubator.apache.org Received: (qmail 55674 invoked by uid 99); 8 Jan 2015 16:20:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 16:20:42 +0000 X-ASF-Spam-Status: No, hits=-1993.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD,URIBL_SBL X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 08 Jan 2015 16:20:40 +0000 Received: (qmail 55448 invoked by uid 99); 8 Jan 2015 16:20:20 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 16:20:20 +0000 Received: from [192.168.1.37] (unknown [202.44.228.17]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id C04A11A0320 for ; Thu, 8 Jan 2015 16:19:29 +0000 (UTC) From: Peter Kelly Content-Type: multipart/alternative; boundary="Apple-Mail=_5053969D-474A-4005-A699-1DE7551FE949" Message-Id: <5F61B81F-E608-4FFF-B66B-DC90BE7FA202@apache.org> Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2064\)) Subject: Re: ODF filter Date: Thu, 8 Jan 2015 23:18:30 +0700 References: <07E07C76-27B2-4B85-A482-83D0C65E22AF@apache.org> <10FF0071-9636-432A-99B1-59C19EE8841B@comcast.net> To: dev@corinthia.incubator.apache.org In-Reply-To: X-Mailer: Apple Mail (2.2064) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_5053969D-474A-4005-A699-1DE7551FE949 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 > On 8 Jan 2015, at 10:59 pm, Peter Kelly wrote: >=20 >> On 8 Jan 2015, at 10:16 am, Dave Fisher = wrote: >>=20 >> Hi Peter, >>=20 >> This is a helpful email from your concrete discussion I can better = understand the mapping between the abstract / HTML model and the = concrete / DOCX, ODT. >>=20 >> You mention differences in the style runs for Word and ODT of which I = am familiar from the OOXML side. Does the abstract model / HTML take a = particular approach towards style runs? Is there a concrete version of = the HTML model? Is there a specification or plan for the abstract model? >=20 > As a general principle, no - a given filter is expected to handle = arbitrary HTML. >=20 > However, there is a function for =93normalising=94 a HTML document to = change nested sets of inline elements (span, b, i, etc.) into a flat = sequence of runs (each represented as a span element). The Word filter = uses this, due to Word=92s flat model of inline runs. Just thought I=92d add a bit more detail on this, for anyone interested = in exploring the implementation: For .docx files, DFPut (api/src/Operations.c) calls WordPut = (filters/ooxml/src/word/Word.c), which in turn creates a WordPackage = object and then calls WordPackageUpdateFromHTML = (filters/ooxml/src/word/WordPackage.c). The very first thing this does = is to call HTML_normalizeDocument and HTML_pushDownInlineProperties = (both in core/src/html/DFHTMLNormalization.c). HTML_normalizeDocument merges adjacent text nodes (which in theory = shouldn=92t be necessary, but I found that sometimes libxml=92s parser = produces two or more in a row), and then goes through all the = block-level elements, flattening any inline elements such that the = resulting block node contains a series of spans, each with a style = attribute set with the appropriate css formatting properties. For = example, if you start with this:

Here is some text

then you=92ll end up with this:

Here is some text

HTML_pushDownInlineProperties checks block elements for any CSS = properties that can be applied to inline formatting (such as font = family, font size, text color) and moves them to the style attributes of = the span elements within the block element. For example, the following:

Some text

would become this:

Some text

Both of these are pre-processing stages that happen before the primary = traversal of the document tree begins, and the latter code in the Word = filter expects the HTML documents to confirm to this more restrictive = =93dialect=94. In the case of the inline properties, it=92s because = these settings have to go on the rPr elements in a word document, and = are not allowed on the pPr elements (that is, Word is more strict in = terms of which formatting properties can be set where; HTML allows you = to set =93inline=94 formatting properties on any element using a style = attribute). So this pre-processing is largely to match the needs of the = Word filter, but it=92s likely that an ODF text document filter will = need some pre-processing as well. As we add more formats, I expect we=92ll discover some common places = where there the HTML input needs to be normalised to a certain form, and = also places where it is better to leave it as-is. The ability to have = nested inline elements in ODF is an example of the latter; we can = probably avoid HTML_normalizeDocument in that case by having a direct = relationship between HTML inline elements and ODF text-span elements. = Depending on the situation, retaining such structure may be important - = but that=92s something I expect we=92ll discover as we proceed with = implementation. =97 Dr Peter M. Kelly pmkelly@apache.org PGP key: http://www.kellypmk.net/pgp-key = (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966) --Apple-Mail=_5053969D-474A-4005-A699-1DE7551FE949--