corinthia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian C <...@amham.net>
Subject Re: Flat: Formal LAnguage Translation
Date Tue, 16 Jun 2015 12:25:25 GMT
Thanks Peter,

On Tue, Jun 16, 2015 at 2:16 AM, Peter Kelly <pmkelly@apache.org> wrote:

> For the past year or so I’ve been playing around with the idea of
> developing a programming language specifically for expressing translations
> between different formal languages (in the programming language sense, not
> natural languages).
>
> A document format used for word processing/other office-style applications
> can be considered a formal language, where it is possible to specify its
> structure in some formal notation, such as BNF. ODF and OOXML both fit this
> criteria, as they use XML and their tree structure is defined by the
> RelaxNG schemas that are part of the specifications - there are copies of
> these in the ‘schemas’ directory in the Corinthia repository. The
> zip-packaging complicates this a little, but it would be feasible to find a
> way to express this formally in terms of what files are required in a
> package, how they relate to one another, and what schema each must conform
> to.
>
> I believe that ODF, OOXML, and HTML are not the only file formats we
> should consider supporting in Corinthia - there are numerous others,
> including:
>
> - Markdown [1] / MultiMarkdown [2] / CommonMark [3]
> - reStruturedText [4] (popular in the Python world)
> - AsciiDoc [5]
> - RTF [6]
> - LaTeX [7]
> - Various Wiki markup languages
>
> All of these languages are difficult to parse manually, and really need to
> be tackled through the use of a parser generator or other language
> processing toolkit.
>
> Let’s face it, C sucks for what we’re doing, as it’s too low level and we
> constantly have to take care of manual memory management, awkward string
> handling, and protecting against crashes that can so often arise due to
> NULL pointer dereferences, memory corruption, use-after-free and other
> bugs. The Objective C version of the code was better, but still not ideal,
> and not cross platform.
>
> One programming language I’ve been especially inspired by in some past
> work I’ve done is Stratego/XT [8]. It includes the following features:
>
> - A grammar specification language (SDF), which can be used as input to
> parser generators, which will automatically construct abstract syntax
> trees, expressed in the ATerm format (which is similar to Lisp’s
> S-expressions).
>
> - The Stratego language, which allows specification of rewrite rules based
> on pattern matching, in a notation similar to that used to express the
> denotational semantics of programming languages. The language also includes
> the ability to express strategies for how rewrite rules should be applied,
> such as pre-order or post-order traversal of the syntax tree, repeated
> attempts at rule application, and so forth; these strategies can be
> composed in arbitrary ways. The language is also a powerful functional
> programming language in it’s own right, and has many other features for
> analysis of syntax trees to determine things like variable scopes.
>
> - A pretty-printing language for concisely describing how an abstract
> syntax tree can be turned back into a text file in a given language, for
> later compilation or editing.
>
> Sadly, Stratego/XT itself seems to have fallen into a state of disrepair,
> and many of the links on the website are broken (but can probably be dug up
> with a bit of work). Their publications page (
> http://strategoxt.org/Stratego/StrategoPublications <
> http://strategoxt.org/Stratego/StrategoPublications>) has plenty of good
> info though. They’ve now moved on to what is called the Spoofax Language
> Workbench which, even more sadly, depends on both Java and Eclipse. I’m
> sure it’s possible to get the core language operating standalone, but I
> haven’t looked into it.
>
> Anyway, Stratego fundamentally changed the way I thought about the process
> of writing compilers, and opened my eyes to a whole world of possibilities
> in program transformation. I've found it useful in several different
> contexts, such as instrumenting code for performance measurement, and
> detecting plagiarism in student code submissions for CS assignments.
>
> Another important development in the last decade or so has been Parsing
> Expression Grammars (PEGs), and packrat parsing [10]. The two key papers
> you want to read on these are [11] and [12], although Bryan Ford and others
> have published many other papers on the technique. PEGs are context-aware
> grammars that combine both lexical and syntactic analysis (unlike the
> separation seen in flex & bison). They are readily composable, and avoid
> many of the ambiguity problems inherent with context-free grammars, such as
> the “dangling else” problem. A good example of a PEG-based parser
> implementation is Grako [13], which is implemented in Python.
>
> Having seen these various things, and thinking about how to make the
> process of writing document format translators easier, I’m of the view that
> it would be immensely advantageous for us to have these concepts combined
> into a single language/runtime library. In principle, this could be
> developed independently of Corinthia, but I think would be better done as
> part of the project as it’s something which we have a very direct
> application for, namely writing filters for the above-mentioned file
> formats, and possibly porting the OOXML filter and ODF filter across to as
> well. It’s also been several years since I’ve had a chance to sink my teeth
> into some real programming language research, and I’m keen to get back to
> my roots.
>
> I’ve done about seven or eight prototypes so far, in both C and more
> recently Python - so far addressing only the parsing and syntax tree
> construction aspects, without any thought yet as to the language itself.
> I’m currently spending a lot of time with the famous “Dragon Book”,
> reviewing the chapters on parsing and syntax-directed translation, and
> implementing the algorithms described in the text to solidify my
> understanding of how they work. I can share the code if you’re interested,
> though right now it’s all completely undocumented and nowhere near ready
> for use by others, and you’ll probably find either the dragon book itself
> or other online tutorials to be more useful. I should emphasis that it’s
> all throwaway code, and I intend on doing a “proper” implementation from
> scratch within the Corinthia repository.
>
> I’ve pretty much figured out what needs to be done for the parsing and
> syntax tree construction, and the next step is coming up with a suitable
> runtime representation, that can be used both directly from C code, and
> from the “Flat” language (whatever form it eventually takes), and
> potentially bindings to other languages like Python and JavaScript. What I
> envisage here is an approach in which all such data structures are
> immutable, structured using ATerms (which Stratego uses, and are like
> S-expressions), and have transformations expressed in a purely functional
> style, with Stratego-style pattern matching.
>
> Our current runtime representation of data obtained from source files is
> XML, since this is what the currently-supported formats are based on. We
> could either stick with this, translating to and from the ATerm-style
> syntax where appropriate, or (and I would argue this is better), to use a
> common type system for all transformations. The advantage of the latter is
> that all transformations could be subject to type checking, which means the
> Flat compiler would statically guarantee that the output produced by e.g. a
> HTML to ODF translation fully conforms to the ODF schema, the same way that
> a C/Java compiler does type checking. This would allow us to have
> formally-verified translations between file formats.
>
> The transformations would not necessarily have to use code generation; we
> could alternatively implement them via an interpreter. Both could be
> supported, with back-ends that output in C code, or bytecode that is
> understood by the interpreter. In either case, there will need to be
> runtime support consisting of a garbage collector and likely an exception
> handling mechanism; a debugger and a Python-style REPL interface would also
> be handy for development purposes.
>
> As an example of the type of grammar I’m thinking of, see the following
> one for MultiMarkdown (but imagine it without the semantic actions, which
> are written in C):
>
> https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg <
> https://github.com/fletcher/MultiMarkdown-4/blob/master/parser.leg>
>
> I still have a lot of reading to do on many topics, and I’ve been meaning
> to start discussing it here on the list for a long time but have yet to get
> around to it. Sorry for dumping so much stuff at once (and I realise it’s
> pretty heavy), but at least now we can start to get the discussion going
> about what we want in a transformation language, and how we can make the
> development of new filters easier.
>

Head exploded now :-) Whilst a lot of the detail went over my head I have
been working with ANTLR for the past few years and found 11 and 12
interesting. But the heavy maths leaves me high and dry. Notionally though
I do think we should/could express the mapping between elements in such a
way that the translations could be generated, without missing cases. Or at
least unknowingly missing cases.

My work in my day job with the compiler I work on was the feeder for the
tool I developed to track how much of a schema is used and how. Which seems
to me to be very similar problem. Instead of just tracking we map or
generate a new representation.

>
> [1] http://daringfireball.net/projects/markdown/syntax <
> http://daringfireball.net/projects/markdown/syntax>
> [2] http://fletcherpenney.net/multimarkdown/ <
> http://fletcherpenney.net/multimarkdown/>
> [3] http://commonmark.org <http://commonmark.org/>
> [4] http://docutils.sourceforge.net/rst.html <
> http://docutils.sourceforge.net/rst.html>
> [5] http://www.methods.co.nz/asciidoc/ <http://www.methods.co.nz/asciidoc/
> >
> [6] http://www.microsoft.com/en-US/Download/details.aspx?id=10725 <
> http://www.microsoft.com/en-US/Download/details.aspx?id=10725>
> [7] http://www.latex-project.org <http://www.latex-project.org/>
> [8] http://strategoxt.org <http://strategoxt.org/>
> [9] http://strategoxt.org/Spoofax/WebHome <
> http://strategoxt.org/Spoofax/WebHome>
> [10] http://bford.info/packrat/ <http://bford.info/packrat/>
> [11] http://bford.info/pub/lang/packrat-icfp02.pdf <
> http://bford.info/pub/lang/packrat-icfp02.pdf>
> [12] http://bford.info/pub/lang/peg.pdf <
> http://bford.info/pub/lang/peg.pdf>
> [13] https://pypi.python.org/pypi/grako/3.6.1 <
> https://pypi.python.org/pypi/grako/3.6.1>
> [14]
> http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
> <
> http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
> >
>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>
-- 
Cheers,

Ian C

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message