Sketching a parser for graphviz/dot. Really slow on a ~300 statement input. Got about a 10x improvement by using Left instead of Right-recursion.

Here’s the grammar I used initially, which was inspired by the recursive rules in the dot lang specification. If you’re familiar with dot, you’ll notice some hardcoded elements as well as some missing patterns. This grammar parses dot only partially, but it’s enough for my purpose here. It handles nodes, edges and attributes.

(def grammar "
    G = 'digraph' <WS> graph-name <WS*> block <WS*>

    <ws> = <#'[\\s\\t]+'>           (* whitespace *)
    <WS> = <#'[\\s\\t\\n\\r]+'>   (* multiline whitespace, graphviz is very newline tolerant *)

    <string-q> = <'\"'> #'[^\"]*' <'\"'>
    <string-uq> = #'[a-zA-Z_][a-zA-Z_0-9]+'
    <string> = string-q | string-uq

    block = <'{'> WS* stmt-list WS* <'}'>

    graph-name = string
    node = string

    node-from = node
    node-to = node
    arrow = '->'

    propName = string
    propValue = string
    prop = propName <'='> propValue
    propList = prop <ws*> propList?
    props = <'['> propList? <']'>

    nodeSpec = <WS*> node <WS*> props?
    edgeSpec = <WS*> node-from <WS*> <arrow> <WS*> node-to <WS*> props?

    stmt-list = stmt stmt-list?
    <stmt> = nodeSpec | edgeSpec
   ")

(def parse-dot (instaparse/parser grammar))

There are two right-recursions specified here:

  • propList = prop <ws*> propList?
  • stmt-list = stmt stmt-list?

Running this against my sample input takes about 2 seconds which is huge!

(time (parse-dot input))
"Elapsed time: 2199.23539 msecs"

After looking around for hints, I got to the Instaparse Performance article which says:

Instaparse’s algorithm is in the family of LL parsing algorithms. So if you know how to easily write your grammar as an LL grammar, that’s probably going to yield the best possible performance. If not, don’t worry about it.

To be frank, I have no clue how to easily write my grammar as a LL grammar. But what if I swap the recursion order to left recursion?

  • propList = propList? prop <ws*>
  • stmt-list = stmt-list? stmt
(time (parse-dot input))
"Elapsed time: 284.527448 msecs"

wowzz.. a 10x improvement!

Tracing the parse process

Instaparse has an option to trace it’s execution. Traces get long very fast. The final line of the output is a “profile” report. It can be used to evaluate how the execution scales with input.

I created two inputs to compare the two grammars. First input is a few statements long and the second 10 times larger (only replicating the first one for ten times, to maintain similar parsing complexity).

;; Enable tracing in instaparse
(parse-dot input :trace true)

                                        ; Original grammar, right recursion
;; Small input
Profile:  {:create-node 541
           :push-full-listener 3
           :push-stack 541
           :push-listener 582
           :push-result 369
           :push-message 396}
;; 10x input
Profile:  {:create-node 5092
           :push-full-listener 3
           :push-stack 5092
           :push-listener 5558
           :push-result 8651
           :push-message 8937}

                                        ; Left-recursive grammar
;; Small input
Profile:  {:create-node 506
           :push-full-listener 3
           :push-stack 506
           :push-listener 544
           :push-result 313
           :push-message 354}
;; 10x input
Profile:  {:create-node 4718
           :push-full-listener 3
           :push-stack 4718
           :push-listener 5145
           :push-result 2919
           :push-message 3350}

Looking at push-result, notice how a 10x increase in size corresponds to 23x increase in the right-recursive grammar vs. a linear 10x increase with the left-recursive version.

Win!

Find the left-recursive (partially implemented) grammar for dot/graphviz here: https://gist.github.com/valer-cara/9f7af6b95309fceecde551f51a3fd0c0