corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jan i <j...@apache.org>
Subject Re: Flat: Formal LAnguage Translation
Date Sun, 21 Jun 2015 15:07:17 GMT
Hi

I like the idea about formal language translation, it is commonly used
between
compilers (I used the principle years ago writing a basic->C cross
compiler).

However I have a feeling that the exceptions (maybe not the right word) in
OOXML and ODF
makes a formal language converter next to impossible. F.x. converting
tables from ODF to OOXML would be a piece of cake, but going from OOXML to
ODF you first need to detect if
it is a table at all or just styled paragraphs.

I do not want to sound negative, but I do see theoritical problems because
the formats can
be used in so many different ways (unlike the compiler case, where the
compiler itself is the
ultimate judge).

rgds
jan i

On Tuesday, June 16, 2015, Ian C <ian@amham.net> wrote:

> Thanks Peter,
>
> On Tue, Jun 16, 2015 at 2:16 AM, Peter Kelly <pmkelly@apache.org
> <javascript:;>> wrote:
>
> > For the past year or so I’ve been playing around with the idea of
> > developing a programming language specifically for expressing
> translations
> > between different formal languages (in the programming language sense,
> not
> > natural languages).
> >
> > A document format used for word processing/other office-style
> applications
> > can be considered a formal language, where it is possible to specify its
> > structure in some formal notation, such as BNF. ODF and OOXML both fit
> this
> > criteria, as they use XML and their tree structure is defined by the
> > RelaxNG schemas that are part of the specifications - there are copies of
> > these in the ‘schemas’ directory in the Corinthia repository. The
> > zip-packaging complicates this a little, but it would be feasible to
> find a
> > way to express this formally in terms of what files are required in a
> > package, how they relate to one another, and what schema each must
> conform
> > to.
> >
> > I believe that ODF, OOXML, and HTML are not the only file formats we
> > should consider supporting in Corinthia - there are numerous others,
> > including:
> >
> > - Markdown [1] / MultiMarkdown [2] / CommonMark [3]
> > - reStruturedText [4] (popular in the Python world)
> > - AsciiDoc [5]
> > - RTF [6]
> > - LaTeX [7]
> > - Various Wiki markup languages
> >
> > All of these languages are difficult to parse manually, and really need
> to
> > be tackled through the use of a parser generator or other language
> > processing toolkit.
> >
> > Let’s face it, C sucks for what we’re doing, as it’s too low level and we
> > constantly have to take care of manual memory management, awkward string
> > handling, and protecting against crashes that can so often arise due to
> > NULL pointer dereferences, memory corruption, use-after-free and other
> > bugs. The Objective C version of the code was better, but still not
> ideal,
> > and not cross platform.
> >
> > One programming language I’ve been especially inspired by in some past
> > work I’ve done is Stratego/XT [8]. It includes the following features:
> >
> > - A grammar specification language (SDF), which can be used as input to
> > parser generators, which will automatically construct abstract syntax
> > trees, expressed in the ATerm format (which is similar to Lisp’s
> > S-expressions).
> >
> > - The Stratego language, which allows specification of rewrite rules
> based
> > on pattern matching, in a notation similar to that used to express the
> > denotational semantics of programming languages. The language also
> includes
> > the ability to express strategies for how rewrite rules should be
> applied,
> > such as pre-order or post-order traversal of the syntax tree, repeated
> > attempts at rule application, and so forth; these strategies can be
> > composed in arbitrary ways. The language is also a powerful functional
> > programming language in it’s own right, and has many other features for
> > analysis of syntax trees to determine things like variable scopes.
> >
> > - A pretty-printing language for concisely describing how an abstract
> > syntax tree can be turned back into a text file in a given language, for
> > later compilation or editing.
> >
> > Sadly, Stratego/XT itself seems to have fallen into a state of disrepair,
> > and many of the links on the website are broken (but can probably be dug
> up
> > with a bit of work). Their publications page (
> > http://strategoxt.org/Stratego/StrategoPublications <
> > http://strategoxt.org/Stratego/StrategoPublications>) has plenty of good
> > info though. They’ve now moved on to what is called the Spoofax Language
> > Workbench which, even more sadly, depends on both Java and Eclipse. I’m
> > sure it’s possible to get the core language operating standalone, but I
> > haven’t looked into it.
> >
> > Anyway, Stratego fundamentally changed the way I thought about the
> process
> > of writing compilers, and opened my eyes to a whole world of
> possibilities
> > in program transformation. I've found it useful in several different
> > contexts, such as instrumenting code for performance measurement, and
> > detecting plagiarism in student code submissions for CS assignments.
> >
> > Another important development in the last decade or so has been Parsing
> > Expression Grammars (PEGs), and packrat parsing [10]. The two key papers
> > you want to read on these are [11] and [12], although Bryan Ford and
> others
> > have published many other papers on the technique. PEGs are context-aware
> > grammars that combine both lexical and syntactic analysis (unlike the
> > separation seen in flex & bison). They are readily composable, and avoid
> > many of the ambiguity problems inherent with context-free grammars, such
> as
> > the “dangling else” problem. A good example of a PEG-based parser
> > implementation is Grako [13], which is implemented in Python.
> >
> > Having seen these various things, and thinking about how to make the
> > process of writing document format translators easier, I’m of the view
> that
> > it would be immensely advantageous for us to have these concepts combined
> > into a single language/runtime library. In principle, this could be
> > developed independently of Corinthia, but I think would be better done as
> > part of the project as it’s something which we have a very direct
> > application for, namely writing filters for the above-mentioned file
> > formats, and possibly porting the OOXML filter and ODF filter across to
> as
> > well. It’s also been several years since I’ve had a chance to sink my
> teeth
> > into some real programming language research, and I’m keen to get back to
> > my roots.
> >
> > I’ve done about seven or eight prototypes so far, in both C and more
> > recently Python - so far addressing only the parsing and syntax tree
> > construction aspects, without any thought yet as to the language itself.
> > I’m currently spending a lot of time with the famous “Dragon Book”,
> > reviewing the chapters on parsing and syntax-directed translation, and
> > implementing the algorithms described in the text to solidify my
> > understanding of how they work. I can share the code if you’re
> interested,
> > though right now it’s all completely undocumented and nowhere near ready
> > for use by others, and you’ll probably find either the dragon book itself
> > or other online tutorials to be more useful. I should emphasis that it’s
> > all throwaway code, and I intend on doing a “proper” implementation from
> > scratch within the Corinthia repository.
> >
> > I’ve pretty much figured out what needs to be done for the parsing and
> > syntax tree construction, and the next step is coming up with a suitable
> > runtime representation, that can be used both directly from C code, and
> > from the “Flat” language (whatever form it eventually takes), and
> > potentially bindings to other languages like Python and JavaScript. What
> I
> > envisage here is an approach in which all such data structures are
> > immutable, structured using ATerms (which Stratego uses, and are like
> > S-expressions), and have transformations expressed in a purely functional
> > style, with Stratego-style pattern matching.
> >
> > Our current runtime representation of data obtained from source files is
> > XML, since this is what the currently-supported formats are based on. We
> > could either stick with this, translating to and from the ATerm-style
> > syntax where appropriate, or (and I would argue this is better), to use a
> > common type system for all transformations. The advantage of the latter
> is
> > that all transformations could be subject to type checking, which means
> the
> > Flat compiler would statically guarantee that the output produced by
> e.g. a
> > HTML to ODF translation fully conforms to the ODF schema, the same way
> that
> > a C/Java compiler does type checking. This would allow us to have
> > formally-verified translations between file formats.
> >
> > The transformations would not necessarily have to use code generation; we
> > could alternatively implement them via an interpreter. Both could be
> > supported, with back-ends that output in C code, or bytecode that is
> > understood by the interpreter. In either case, there will need to be
> > runtime support consisting of a garbage collector and likely an exception
> > handling mechanism; a debugger and a Python-style REPL interface would
> also
> > be handy for development purposes.
> >
> > As an example of the type of grammar I’m thinking of, see the following
> > one for MultiMarkdown (but imagine it without the semantic actions, which
> > are written in C):
> >
> > https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg <
> > https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg>
> >
> > I still have a lot of reading to do on many topics, and I’ve been meaning
> > to start discussing it here on the list for a long time but have yet to
> get
> > around to it. Sorry for dumping so much stuff at once (and I realise it’s
> > pretty heavy), but at least now we can start to get the discussion going
> > about what we want in a transformation language, and how we can make the
> > development of new filters easier.
> >
>
> Head exploded now :-) Whilst a lot of the detail went over my head I have
> been working with ANTLR for the past few years and found 11 and 12
> interesting. But the heavy maths leaves me high and dry. Notionally though
> I do think we should/could express the mapping between elements in such a
> way that the translations could be generated, without missing cases. Or at
> least unknowingly missing cases.
>
> My work in my day job with the compiler I work on was the feeder for the
> tool I developed to track how much of a schema is used and how. Which seems
> to me to be very similar problem. Instead of just tracking we map or
> generate a new representation.
>
> >
> > [1] http://daringfireball.net/projects/markdown/syntax <
> > http://daringfireball.net/projects/markdown/syntax>
> > [2] http://fletcherpenney.net/multimarkdown/ <
> > http://fletcherpenney.net/multimarkdown/>
> > [3] http://commonmark.org <http://commonmark.org/>
> > [4] http://docutils.sourceforge.net/rst.html <
> > http://docutils.sourceforge.net/rst.html>
> > [5] http://www.methods.co.nz/asciidoc/ <
> http://www.methods.co.nz/asciidoc/
> > >
> > [6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 <
> > http://www.microsoft.com/en-US/Download/details.aspx?id=10725>
> > [7] http://www.latex-project.org <http://www.latex-project.org/>
> > [8] http://strategoxt.org <http://strategoxt.org/>
> > [9] http://strategoxt.org/Spoofax/WebHome <
> > http://strategoxt.org/Spoofax/WebHome>
> > [10] http://bford.info/packrat/ <http://bford.info/packrat/>
> > [11] http://bford.info/pub/lang/packrat-icfp02.pdf <
> > http://bford.info/pub/lang/packrat-icfp02.pdf>
> > [12] http://bford.info/pub/lang/peg.pdf <
> > http://bford.info/pub/lang/peg.pdf>
> > [13] https://pypi.python.org/pypi/grako/3.6.1 <
> > https://pypi.python.org/pypi/grako/3.6.1>
> > [14]
> >
> http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
> > <
> >
> http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
> > >
> >
> > —
> > Dr Peter M. Kelly
> > pmkelly@apache.org <javascript:;>
> >
> > PGP key: http://www.kellypmk.net/pgp-key <
> http://www.kellypmk.net/pgp-key>
> > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> >
> >
> --
> Cheers,
>
> Ian C
>


-- 
Sent from My iPad, sorry for any misspellings.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message