corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kelly <pmke...@apache.org>
Subject Flat: Formal LAnguage Translation
Date Mon, 15 Jun 2015 18:16:33 GMT
For the past year or so I’ve been playing around with the idea of developing a programming
language specifically for expressing translations between different formal languages (in the
programming language sense, not natural languages).

A document format used for word processing/other office-style applications can be considered
a formal language, where it is possible to specify its structure in some formal notation,
such as BNF. ODF and OOXML both fit this criteria, as they use XML and their tree structure
is defined by the RelaxNG schemas that are part of the specifications - there are copies of
these in the ‘schemas’ directory in the Corinthia repository. The zip-packaging complicates
this a little, but it would be feasible to find a way to express this formally in terms of
what files are required in a package, how they relate to one another, and what schema each
must conform to.

I believe that ODF, OOXML, and HTML are not the only file formats we should consider supporting
in Corinthia - there are numerous others, including:

- Markdown [1] / MultiMarkdown [2] / CommonMark [3]
- reStruturedText [4] (popular in the Python world)
- AsciiDoc [5]
- RTF [6]
- LaTeX [7]
- Various Wiki markup languages

All of these languages are difficult to parse manually, and really need to be tackled through
the use of a parser generator or other language processing toolkit.

Let’s face it, C sucks for what we’re doing, as it’s too low level and we constantly
have to take care of manual memory management, awkward string handling, and protecting against
crashes that can so often arise due to NULL pointer dereferences, memory corruption, use-after-free
and other bugs. The Objective C version of the code was better, but still not ideal, and not
cross platform.

One programming language I’ve been especially inspired by in some past work I’ve done
is Stratego/XT [8]. It includes the following features:

- A grammar specification language (SDF), which can be used as input to parser generators,
which will automatically construct abstract syntax trees, expressed in the ATerm format (which
is similar to Lisp’s S-expressions).

- The Stratego language, which allows specification of rewrite rules based on pattern matching,
in a notation similar to that used to express the denotational semantics of programming languages.
The language also includes the ability to express strategies for how rewrite rules should
be applied, such as pre-order or post-order traversal of the syntax tree, repeated attempts
at rule application, and so forth; these strategies can be composed in arbitrary ways. The
language is also a powerful functional programming language in it’s own right, and has many
other features for analysis of syntax trees to determine things like variable scopes.

- A pretty-printing language for concisely describing how an abstract syntax tree can be turned
back into a text file in a given language, for later compilation or editing.

Sadly, Stratego/XT itself seems to have fallen into a state of disrepair, and many of the
links on the website are broken (but can probably be dug up with a bit of work). Their publications
page (http://strategoxt.org/Stratego/StrategoPublications <http://strategoxt.org/Stratego/StrategoPublications>)
has plenty of good info though. They’ve now moved on to what is called the Spoofax Language
Workbench which, even more sadly, depends on both Java and Eclipse. I’m sure it’s possible
to get the core language operating standalone, but I haven’t looked into it.

Anyway, Stratego fundamentally changed the way I thought about the process of writing compilers,
and opened my eyes to a whole world of possibilities in program transformation. I've found
it useful in several different contexts, such as instrumenting code for performance measurement,
and detecting plagiarism in student code submissions for CS assignments.

Another important development in the last decade or so has been Parsing Expression Grammars
(PEGs), and packrat parsing [10]. The two key papers you want to read on these are [11] and
[12], although Bryan Ford and others have published many other papers on the technique. PEGs
are context-aware grammars that combine both lexical and syntactic analysis (unlike the separation
seen in flex & bison). They are readily composable, and avoid many of the ambiguity problems
inherent with context-free grammars, such as the “dangling else” problem. A good example
of a PEG-based parser implementation is Grako [13], which is implemented in Python.

Having seen these various things, and thinking about how to make the process of writing document
format translators easier, I’m of the view that it would be immensely advantageous for us
to have these concepts combined into a single language/runtime library. In principle, this
could be developed independently of Corinthia, but I think would be better done as part of
the project as it’s something which we have a very direct application for, namely writing
filters for the above-mentioned file formats, and possibly porting the OOXML filter and ODF
filter across to as well. It’s also been several years since I’ve had a chance to sink
my teeth into some real programming language research, and I’m keen to get back to my roots.

I’ve done about seven or eight prototypes so far, in both C and more recently Python - so
far addressing only the parsing and syntax tree construction aspects, without any thought
yet as to the language itself. I’m currently spending a lot of time with the famous “Dragon
Book”, reviewing the chapters on parsing and syntax-directed translation, and implementing
the algorithms described in the text to solidify my understanding of how they work. I can
share the code if you’re interested, though right now it’s all completely undocumented
and nowhere near ready for use by others, and you’ll probably find either the dragon book
itself or other online tutorials to be more useful. I should emphasis that it’s all throwaway
code, and I intend on doing a “proper” implementation from scratch within the Corinthia
repository.

I’ve pretty much figured out what needs to be done for the parsing and syntax tree construction,
and the next step is coming up with a suitable runtime representation, that can be used both
directly from C code, and from the “Flat” language (whatever form it eventually takes),
and potentially bindings to other languages like Python and JavaScript. What I envisage here
is an approach in which all such data structures are immutable, structured using ATerms (which
Stratego uses, and are like S-expressions), and have transformations expressed in a purely
functional style, with Stratego-style pattern matching.

Our current runtime representation of data obtained from source files is XML, since this is
what the currently-supported formats are based on. We could either stick with this, translating
to and from the ATerm-style syntax where appropriate, or (and I would argue this is better),
to use a common type system for all transformations. The advantage of the latter is that all
transformations could be subject to type checking, which means the Flat compiler would statically
guarantee that the output produced by e.g. a HTML to ODF translation fully conforms to the
ODF schema, the same way that a C/Java compiler does type checking. This would allow us to
have formally-verified translations between file formats.

The transformations would not necessarily have to use code generation; we could alternatively
implement them via an interpreter. Both could be supported, with back-ends that output in
C code, or bytecode that is understood by the interpreter. In either case, there will need
to be runtime support consisting of a garbage collector and likely an exception handling mechanism;
a debugger and a Python-style REPL interface would also be handy for development purposes.

As an example of the type of grammar I’m thinking of, see the following one for MultiMarkdown
(but imagine it without the semantic actions, which are written in C):

https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg <https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg>

I still have a lot of reading to do on many topics, and I’ve been meaning to start discussing
it here on the list for a long time but have yet to get around to it. Sorry for dumping so
much stuff at once (and I realise it’s pretty heavy), but at least now we can start to get
the discussion going about what we want in a transformation language, and how we can make
the development of new filters easier.

[1] http://daringfireball.net/projects/markdown/syntax <http://daringfireball.net/projects/markdown/syntax>
[2] http://fletcherpenney.net/multimarkdown/ <http://fletcherpenney.net/multimarkdown/>
[3] http://commonmark.org <http://commonmark.org/>
[4] http://docutils.sourceforge.net/rst.html <http://docutils.sourceforge.net/rst.html>
[5] http://www.methods.co.nz/asciidoc/ <http://www.methods.co.nz/asciidoc/>
[6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 <http://www.microsoft.com/en-US/Download/details.aspx?id=10725>
[7] http://www.latex-project.org <http://www.latex-project.org/>
[8] http://strategoxt.org <http://strategoxt.org/>
[9] http://strategoxt.org/Spoofax/WebHome <http://strategoxt.org/Spoofax/WebHome>
[10] http://bford.info/packrat/ <http://bford.info/packrat/>
[11] http://bford.info/pub/lang/packrat-icfp02.pdf <http://bford.info/pub/lang/packrat-icfp02.pdf>
[12] http://bford.info/pub/lang/peg.pdf <http://bford.info/pub/lang/peg.pdf>
[13] https://pypi.python.org/pypi/grako/3.6.1 <https://pypi.python.org/pypi/grako/3.6.1>
[14] http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 <http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811>

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message