Yacc is of the Living Dead

The scientific process seems to evolve towards something like pop culture and music industry. For starters, we have publishing houses and record companies.

And then, there is a scientific article with the title "Yacc is dead" [1] which, quite possibly, has reached more people than all the accepted papers of the conference to which it was submitted.

Russ Cox criticized this paper quite harshly in an insightful blog post [2].

So, in short, the paper proposes a method that yields all parse trees and claims that the approach is "efficiently on average". The authors run a benchmark, but of course a benchmark is not exactly a proof. Russ's argument (by regexp analogy) goes something like we can make up grammars that simulate NFAs. Since there are regexps whose NFA will have O(exp(n)) possible successful runs on input of length n, (maybe something like a*a*), getting all parse trees means asking for trouble.

Where I am not so sure is, whether the authors are really unaware of this. It is possible to understand "parses efficiently on average" as "works efficiently for the average grammar programmers want to parse". Writing an ambiguous grammar by accident is one thing, writing something like a*a* is something else. Agreed, a statement like "works well for the average grammar" does not really have much scientific worth -- and Russ's criticism still stands -- but indeed it may work out in practice.

The math that derives working parsers from their specification is beautiful... the issue for practical purposes is that there are are other approaches to combinator parsing -- these are the ones that should be compared with derivative-based method, not yacc.

There is at least one paper (journal publication) out there that shows how memoization can make top-down functional parsers linear [3?], unfortunately I cannot access it so cannot comment on how applicable it is here. What's next, DMCA for papers? <b>J'accuse</b>.

Regardless what is up with yacc or memoization, among functional programming literature, there is a plethora of papers that use parser combinators to achieve polynomial parsing (the first reference that came to my mind is Swierstra and Duponcheel's paper [4]), but the wikipedia page on combinator parsing [5] has a link to more recent -- 2007 -- work from Frost and Hafiz that looks interesting.

There are no conclusions here. Combinator parsing is not new, but with the rising popularity of Scala, it is pretty much alive and kicking. After all, it has a packrat combinator parsing library included. For my next parser, I will certainly use that one and not Might and Darais approach. Still, I think the approach described in Might and Darais paper has educational value and I sincerely hope the authors will come up with a longer, more polished version.

[1] Matthew Might, David Darais. Yacc is dead.

[2] Russ Cox. Yacc is not dead.

[3?] Frost, Szydlowski. Memoizing purely functional top-down backtracking language processors. Science of Computer Programming
Volume 27, Issue 3, November 1996, Pages 263-288

[4] Swierstra, Duponcheel. Deterministic, error correcting parser combinators.  

[5] http://en.wikipedia.org/wiki/Parser_combinator

1 comment:

Matt said...

Thanks for the insightful comment and the helpful pointers.

Needless to say, David and I were touched by all the attention (both positive and critical) the draft received.

The community feedback has been outstanding, and we're well on our way toward a thoughtful revamp.

To be clear, we were aware of the exponential complexity, but felt this wasn't important since it never showed up when we used it in practice.

Moreover, we had also discovered a few optimizations that appear to bring the complexity down to polynomial, but ran out of time to flush them out for the submission last year.

It looks like derivative-based parsing will be the usual O(n^3) in general, and O(n^2) for unambiguous grammars in the worst case. And, of course, fast in practice.

Thanks to all the attention, we're moving forward on the idea again, and we're hoping to publish a significantly expanded and cleaned up draft in the near future.