openoffice-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J├╝rgen Schmidt <jogischm...@gmail.com>
Subject Re: News about the new OOXML framework.
Date Wed, 11 Jun 2014 07:34:02 GMT
On 10/06/14 23:54, Marcus (OOo) wrote:
> Am 06/10/2014 11:58 AM, schrieb Andre Fischer:
>> Another update of my progress.
>>
>> I can now create a validating parser, i.e. one that checks that a
>> document conforms to the specs while it parses its content.
>> At the moment the validation is restricted to complex types (as opposed
>> to simple types and attributes) but I think that is the hardest part.
>>
>> One NFA (non-deterministic finite automaton) is created for each complex
>> type and one for the top level elements. The NFAs are then converted
>> into equivalent DFAs (deterministic FAs) and finally minimized (via the
>> Hopcroft algorithm). The minimization step became necessary when I added
>> support for the 'all' schema element which states that its children each
>> occur once in arbitrary order. Recognizing this with an FA leads to
>> enumerate all permutations of the children. With n children there are n!
>> permutations. Luckily the 'all' element is used only once and then only
>> for 7 children (7! = 5040).
>>
>> Here are some numbers:
>> The 1st and 4th edition of the ECMA-376 specification (1st edition is
>> what is used by MS Office, 4th edition is equivalent to the ISO
>> standard) have 40 schema files.
>> These contain 1917 complex types and 781 simple types.
>> Used are 1851 complex types and 727 simple types (have to check if there
>> are really unused complex types or if my optimization is broken).
>>
>> The non-validating parser has 1853 states and 6987 transitions.
>>
>> The validating parser has 129530 states and 43512 transitions after
>> creating the NFAs.
>> After conversion to DFAs there remain 20999 states and 73772 transitions.
>> After minimization there are 6097 states and 34286 transitions.
>>
>> Please note that the time for parsing OOXML documents does not depend on
>> the number of states or transitions. It only depends on the length of
>> the input. The number of states and transitions only make the parser
>> bigger.
>>
>> Progress and commits are tracked in issue 125035.
> 
> that's existing. I cannot really follow the technical stuff and what
> this all means but it sounds like a great progress in the OOXML area.
> And for me this is exiting.

well it simply means that we spent already a lot of time in analyzing
the stuff we have, what and how we can improve it. And it shows that we
follow a very professional approach ;-)

We mean it serious with an improved and better OOXML support including
export. It is what users need and expect because the interoperability
with MS Office is still very important and OOXML becomes more and more
popular and broader used. If we like it or not.

This covers also a lot of ground work, refactoring and feature
development over time to make it possible. The parsing framework is only
one but a very important part of it.

And we will propose, discuss all the things we have in mind in detail
here on the list that everybody who is interested can join our efforts.

Juergen



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org


Mime
View raw message