Another update of my progress.
I can now create a validating parser, i.e. one that checks that a
document conforms to the specs while it parses its content.
At the moment the validation is restricted to complex types (as opposed
to simple types and attributes) but I think that is the hardest part.
One NFA (nondeterministic finite automaton) is created for each complex
type and one for the top level elements. The NFAs are then converted
into equivalent DFAs (deterministic FAs) and finally minimized (via the
Hopcroft algorithm). The minimization step became necessary when I
added support for the 'all' schema element which states that its
children each occur once in arbitrary order. Recognizing this with an FA
leads to enumerate all permutations of the children. With n children
there are n! permutations. Luckily the 'all' element is used only once
and then only for 7 children (7! = 5040).
Here are some numbers:
The 1st and 4th edition of the ECMA376 specification (1st edition is
what is used by MS Office, 4th edition is equivalent to the ISO
standard) have 40 schema files.
These contain 1917 complex types and 781 simple types.
Used are 1851 complex types and 727 simple types (have to check if there
are really unused complex types or if my optimization is broken).
The nonvalidating parser has 1853 states and 6987 transitions.
The validating parser has 129530 states and 43512 transitions after
creating the NFAs.
After conversion to DFAs there remain 20999 states and 73772 transitions.
After minimization there are 6097 states and 34286 transitions.
Please note that the time for parsing OOXML documents does not depend on
the number of states or transitions. It only depends on the length of
the input. The number of states and transitions only make the parser
bigger.
Progress and commits are tracked in issue 125035.
Best regards,
Andre

To unsubscribe, email: devunsubscribe@openoffice.apache.org
For additional commands, email: devhelp@openoffice.apache.org
