shindig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Shepherd <dan.x.sheph...@googlemail.com>
Subject Re: Caja-based HtmlParser and Parser Overhaul (issue157161)
Date Sat, 05 Dec 2009 01:17:24 GMT
Ha ha - good point! I thought about going out but its raining, so was
dancing around on my own with mobile phone headpones on when phone pinged
with notification of your little shindig.  How is it progressing anyway?
Have not looked at it in six months or more.

On 5 Dec 2009 01:08, "John Hjelmstad" <fargo@google.com> wrote:

I'm confused. Gate crashing typically means the party's worth going to! :)

So as to prevent this topic from veering too off course, I proffer the
following overview of the CL for anyone interested to review. I understand
Paul and Louis are on board. All comments welcome.

The changes are arrayed as follows:
1. Refactoring. Previously a large amount of generic (ie. should apply to
any) HTML parser testing was stuck in the nekohtml-package tests. To the
maximum extent possible, this has been pulled into parse-package classes:
 - AbstractParsingTestBase includes helper methods for any parsing- or
serialization-based test. Pulled from AbstractParserAndSerializerTest
 - AbstractParserAndSerializer test contains several "common"
parse/serialize tests no matter the concrete impl.
 - AbstractSocialMarkupHtmlParserTest pulls the social-markup test from
Neko into a base class.
 - "Actual" tests are trivial subclasses of the abstract tests, providing a
GadgetHtmlParser instance.
 - Tests converted to jUnit 4 as a side note.
Subtleties:
 - Neko-based tests override a few base parse/serialize tests due to Neko
oddities. All test files have been moved to base or nekohtml subdir to
follow suit.

2. GadgetHtmlParser normalization implemented.
 - GadgetHtmlParser.normalizeFragment() removed - logic now inlined into
parseDom().
   + Rationale: IMO (open to discussion) the abstract parseDomImpl() API is
unnecessary/does too much. Pretty much all gadget HTML is treated as tag
soup and cleaned up. Having a base method whose contract is to give back
unmodified tag soup thus seems right to me, with a single implementation of
the normalization logic.
 - GadgetHtmlParser.parseDom() implements a large chunk of document
normalization logic. It takes tag soup as input and returns a valid HTML
document with a single top-level HTML element, in turn with two children:
head and body.
   + Multiple <head> nodes consolidated together. Likewise body.
   + Elements above first <head> -> end up in head.
   + Elements above first <body> -> end up in body.
   + Elements after <body> -> end up in body unless inside a <head> node.
   + <style> nodes pulled to <head> in relative order - only HTML-compliant
place for them, and no possibility that there will be conflicts (no
displayable elements in <head>).
 - OpenSocial template parsing MAY be done as a post-processing pass on
<script> nodes. Text found therein is treated as OS (X|HT)ML.
Subtleties:
 - Lots. @see parseDom() impl especially.
 - NekoSimplifiedHtmlParser still impl's separate logic for parseDomImpl
and parseFragmentImpl. I didn't dive into the difference and whether we
could actually get rid of parseDomImpl in this round.

3. CajaHtmlParser implementation.
 - Depends on Caja r3889 (pom.xml updated to reflect this).
 - Unfortunately, parseDomImpl() does top-level <html> node synthesis to
ensure document.getDocumentElement() returns it. This is for
NekoSimplified/Caja dual compatibility w/ GadgetHtmlParser base logic. As
noted, I'd prefer to move this synthesis code into
GadgetHtmlParser.parseDom() if possible.
 - Pretty straightforward past that. Defers to Caja's parser for fragment
processing. That's about it.

Misc: setValijaMode(true) removed from CajaContentRewriter, since it's now
default in the relevant Caja version.

-j-

On Fri, Dec 4, 2009 at 4:41 PM, Dan Shepherd

<dan.x.shepherd@googlemail.com>wrote: > Indeed :) sorry for gate crashing! >
> On 5 Dec 2009 00:35,...

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message