corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <pmke...@apache.org>
Subject Tags and names (Branch"odf-filter-attempt2" review)
Date Tue, 12 May 2015 06:34:43 GMT
The translateXMLEnumName array is unnecessary; there’s already a similar array in DFXMLNames.c.
If you just want to get the name of a node (without concern to the namespace), you can use
DFNodeName.

More generally, I should explain what the tags are for and how they’re used. With namespaces
in XML, the XML file defines a set of prefix/URI mappings either at the top of the file or
on individual elements. This is a real pain to deal with because different prefixes can refer
to different URIs throughout the document, and what an application is almost always interested
in is the (namespace URI, localName) pair; the prefix is purely a convenience mechanism for
human readers of the XML data and to reduce file size.

So I came up with the Tag mechanism in which each (namespace URI, localName) pair has a unique
integer value, and that gets recorded for each node regardless of the prefix in use. This
is done in the parsing code in DFXML.c, and makes use of a DFNameMap structure to keep track
of the associations between:

1) namespace URIs (strings) and namespace IDs (unsigned 32-bit integers), and
2) local names (strings) and tags (unsigned 32-bit integers)

All the namespace IDs and tags we’re likely to use are hard-coded in DFXMLNamespaces.h and
DFXMLNamespaces.h, so we can refer to them from code and use them in switch statements (which
only allow constants for the cases). However if any other namespaces or local names are found
during parsing, then the parser will generate new, unallocated integer values for them.

Given a particular tag, there are functions available to get both the namespace URI and local
name that tag represents. For example, the tag HTML_H1 has the namespace URI "http://www.w3.org/1999/xhtml
<http://www.w3.org/1999/xhtml>” and the local name “h1”. Suppose you have a node
n with this tag. If you do the following, you will see these two string values:

    DFNameMap *map = n->doc->map;
    const TagDecl *td = DFNameMapNameForTag(map,n->tag);
    const NamespaceDecl *ns = DFNameMapNamespaceForID(map,td->namespaceID);
    printf("namespace URI = %s\n",ns->namespaceURI);
    printf("local name = %s\n",td->localName);

output:

    namespace URI = http://www.w3.org/1999/xhtml
    local name = body

I’ll go through this step by step:

1. DFNameMap *map = n->doc->map;

This gets a reference to the name map used by the document. Each DFDocument object has a separate
map and any tags and namespaceIDs that are not hard-coded must be interpreted within the context
of this map. We don’t actually need this as a separate variable, we could just refer to
n->doc->map in the following two lines:

2. const TagDecl *td = DFNameMapNameForTag(map,n->tag);

This gets the tag declaration, which consists of a namespace id and a string name. TagDecl
is defined in DFXMLNames.h as follows:

typedef struct {
    unsigned int namespaceID;
    const char *localName;
} TagDecl;

3. const NamespaceDecl *ns = DFNameMapNamespaceForID(map,td->namespaceID);

This gets the namespace declaration, which consists of the namespace URI and the a prefix
(if there were multiple prefixes used in the input file, this will be the first; for a new
document, this will be the default prefix defined in DFXMLNamespace.c)

NamespaceDecl is defined in DFXMLNamespaces.h as follows:

typedef struct {
    const char *namespaceURI;
    const char *prefix;
} NamespaceDecl;

So that’s how we get from Tag -> (localName, namespaceID) -> (namespaceURI, prefix).

Did I mention that XML namespaces suck? ;) This is why everyone uses JSON these days when
building web APIs. But I digress...

So that’s more than you probably wanted to know about how DocFormats tries to cover over
this tragic design choice.  It’s not so bad though, as there’s some convenience functions
which do the above for you:

If you have a tag, you can just do:

    printf("namespace URI = %s\n",DFTagURI(n->doc,n->tag));
    printf("local name = %s\n",DFTagName(n->doc,n->tag));

Even simpler, if you have a node, you can do:

    printf("namespace URI = %s\n",DFNodeURI(n));
    printf("local name = %s\n",DFNodeName(n));

The above two lines of code are all you need to get a string representation of the namespace
URI and local name; all the DFNameMap stuff is hidden away inside these functions (you can
have a look at them in DFDOM.c to see how they work; it’s basically what I described above).

Having said all this, the only time it should be necessary to actually look at the string
representation of tag names or namespaces is for debugging purposes. One of the motivations
behind using integers for representing these, in addition to abstracting over the whole prefix
mapping mess, was to avoid the need for string comparisons, thus leading to better performance
(which is mainly an issue where you want to test against 20+ possibilities in a loop).

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 10 May 2015, at 8:52 am, Gabriela Gibson <gabriela.gibson@gmail.com> wrote:
> 
> Hi,
> 
> So far I got my branch to produce a list of html nodes (and report on
> still missing stuff).
> 
> This is probably a good point to have a look if the approach I'm using
> here is any good.
> 
> It of course has quite a few warts still, and I think I will need to
> add function pointers to the ODF_to_HTML_key struct to deal with some
> special cases.  If that struct is a good idea that is.
> 
> The branch can be found here:
> 
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
> 
> I added the test odt file I was using, plus the current output of the program.
> 
> thanks for looking,
> 
> G
> 
> -- 
> Visit my Coding Diary: http://gabriela-gibson.blogspot.com/


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message