openoffice-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Fischer <awf....@gmail.com>
Subject Re: News about the new OOXML framework.
Date Tue, 10 Jun 2014 09:58:56 GMT
Another update of my progress.

I can now create a validating parser, i.e. one that checks that a 
document conforms to the specs while it parses its content.
At the moment the validation is restricted to complex types (as opposed 
to simple types and attributes) but I think that is the hardest part.

One NFA (non-deterministic finite automaton) is created for each complex 
type and one for the top level elements.  The NFAs are then converted 
into equivalent DFAs (deterministic FAs) and finally minimized (via the 
Hopcroft algorithm).  The minimization step became necessary when I 
added support for the 'all' schema element which states that its 
children each occur once in arbitrary order. Recognizing this with an FA 
leads to enumerate all permutations of the children.  With n children 
there are n! permutations.  Luckily the 'all' element is used only once 
and then only for 7 children (7! = 5040).

Here are some numbers:
The 1st and 4th edition of the ECMA-376 specification (1st edition is 
what is used by MS Office, 4th edition is equivalent to the ISO 
standard) have 40 schema files.
These contain 1917 complex types and 781 simple types.
Used are 1851 complex types and 727 simple types (have to check if there 
are really unused complex types or if my optimization is broken).

The non-validating parser has 1853 states and 6987 transitions.

The validating parser has 129530 states and 43512 transitions after 
creating the NFAs.
After conversion to DFAs there remain 20999 states and 73772 transitions.
After minimization there are 6097 states and 34286 transitions.

Please note that the time for parsing OOXML documents does not depend on 
the number of states or transitions.   It only depends on the length of 
the input.  The number of states and transitions only make the parser 
bigger.

Progress and commits are tracked in issue 125035.

Best regards,
Andre


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org


Mime
View raw message