corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jan iversen (JIRA)" <>
Subject [jira] [Updated] (COR-20) Write an XML/HTML parser
Date Sun, 18 Jan 2015 09:30:34 GMT


jan iversen updated COR-20:
    Assignee: Peter Kelly

> Write an XML/HTML parser
> ------------------------
>                 Key: COR-20
>                 URL:
>             Project: Corinthia
>          Issue Type: Improvement
>          Components: DocFormats - core, DocFormats - platform
>            Reporter: Peter Kelly
>            Assignee: Peter Kelly
>             Fix For: 0.5
> Currently we rely on libxml2 and HTML Tidy for parsing XML and HTML, respectively. In
both cases we are only using the parsing functions of libraries, not other features like the
DOM tree or other things.
> Parsing XML is not very difficult to do. HTML slightly more, because of all the ambiguities
that arise from the poorly-defined parsing rules in earlier versions of the spec ("make a
best effort" became "replicate what internet explorer does" because almost every site violated
the rules). However the HTML5 spec now defines a proper parsing algorithm that deals with
said ambiguities. We'll need to also take into account the details of which tags must have
a corresponding close dag and which tags do not require this.
> Having our own parser will simplify dependencies a lot, particularly with the somewhat
awkward HTML tidy.

This message was sent by Atlassian JIRA

View raw message