17 min. reading time

As a language engineer, you essentially have to deal with the parsing and the semantic analysis of (programming) languages. You get your hands dirty with carefully constructing compilers, interpreters, code generators, refactoring tools, and static code analyzers.


Powerful language workbenches, e.g. Xtext or MPS will help you a lot with these precious tasks by offering the generation of:

  • a scanner and parser from your EBNF-like grammar,
  • a meta-model representing the resulting abstract syntax tree (AST),
  • an integrated editor (e.g., for Eclipse or Intellij) with syntax highlighting and code completion,
  • stubs for code generation or meta-model transformation in general, and
  • stubs for interpreters.

Hence, you are very well supported for all the necessary business that comes after parsing. But what do you do if you do not want to face the mammoth task of learning these tools? Or if you have to work in a technical space with limited resources?

This article showcases the embedded parsing and evaluation applicable to small embedded Domain Specific Languages (DSLs) by utilizing three concepts.

Firstly, a pure embedding in a powerful, statically typed host language Scala. This allows for high interoperability not just with existing general-purpose Scala code, but also with other languages running on the Java Virtual Machine (JVM). No need for an own pre-processor or language extension.

Secondly, parser combinators. They allow for a compact language-integrated scanning and parsing. Furthermore, handy parsers for Java or often-used regular expressions are provided, the syntax with very few parser operators is quite simple, and they can be extended quite easily via inheritance. On the downside, performance may not be top-notch due to the usage of backtracking while parsing.

And last but not least, Reference Attribute Grammars (RAGs) are utilized. This elderly concept got stuck in academia somehow (as a nice introduction, the interested reader is advised to read, e.g., this paper) but is useful when trying to reason about languages. Attribute grammars extend context-free grammars with declarative equations that relate the values of attributes of grammar symbols to each other. Most attribute grammar systems translate the equations into an implementation written in a general purpose programming language. The translation makes decisions about attribute evaluation order and storage, removing the need for manual tree traversal techniques such as visitors. Therefore, the developer is freed to focus on the language properties that are represented by the attributes. On the implementation side, Kiama is used. Kiama’s attribute grammar library supports a range of RAG features including cached, uncached, higher order, parameterized, and circular attributes. Kiama also implements techniques for dynamic extension and variation of attribute equations.

Overall, the Scala programming language is used because of its support for domain-specific notations and its emphasis on scalability. Unlike generators with specialized notation, Kiama attribute grammars use standard

Scala concepts such as case classes, pattern-matching functions for equations, traits and mixins for composition, and implicit parameters for forwarding.

In the following, the basic concepts for implementing an embedded parser using parser combinators and evaluating the parsed result are shown.

A Very Simple Math Language

Simple mathematical expressions, e.g., 1 + 2 + 3 * 0 should be correctly parsed and evaluated. Kiama utilized plain Scala case classes for building the AST and representing the meta-model in one wash-up:

sealed abstract class Exp extends Product

case class Num(i: Double) extends Exp
case class Add(l: Exp, r: Exp) extends Exp
case class Sub(l: Exp, r: Exp) extends Exp
case class Mul(l: Exp, r: Exp) extends Exp
case class Div(l: Exp, r: Exp) extends Exp

Parsing

As a first step, parsing is now done with the following:

import org.bitbucket.inkytonik.kiama.parsing.{Error, Failure, Parsers, Success}
import org.bitbucket.inkytonik.kiama.util.{Positions, StringSource}

class MathParser extends Parsers(new Positions) {

 // The entry point. Construct an Exp out of a String
 def parse(in: String): Exp = parse(expr, StringSource(in)) match {
   case Success(e, _) => e
   case Failure(m, _) => throw new RuntimeException("Parsing failure: " + m)
   case Error(m, _) => throw new RuntimeException("Parsing error: " + m)
 }

 lazy val expr: Parser[Exp] = term ~ rep("[+-]".r ~ term) ^^ {
   case t ~ ts => ts.foldLeft(t) {
     case (t1, "+" ~ t2) => Add(t1, t2)
     case (t1, "-" ~ t2) => Sub(t1, t2)
   }
 }

 lazy val term = factor ~ rep("[*/]".r ~ factor) ^^ {
   case t ~ ts => ts.foldLeft(t) {
     case (t1, "*" ~ t2) => Mul(t1, t2)
     case (t1, "/" ~ t2) => Div(t1, t2)
   }
 }

 lazy val factor = "(" ~> expr <~ ")" | num

 lazy val num = regex("[0-9]+".r) ^^ (s => Num(s.toDouble))
}

Executable grammars are constructed from the composition of parsers. These compositions are described as functional mappings. The compositional integrity of the parser types is guaranteed by implementing their classes using monads.

With this, a parser combinator is typically a set of production rules. They are using the operators defined in the Scala Parser class for composing parser results as shown in the following table:

Operator Purpose Example from above
"..."
Definition of a Literal
"+"
"...".r
Definition of a Regex
"[+-]".r
P <~ Q
Composition of parser P and Q, only the result of P is kept
expr <~ ")"
P ~> Q
Composition of parser P and Q, only the result of Q is kept
"(" ~> expr
P ~ Q
Composition of parser P and Q, both the results of P and Q are kept
"[*/]".r ~ factor
P | Q
Alternative of parser P and Q
"(" ~> expr <~ ")" | num
P ^^ f
Converting the result of parser P with applying function f
regex("[0-9]+".r) ^^ (s => Num(s.toDouble))

Evaluation

As the second step, the evaluation is now purely simple:

import org.bitbucket.inkytonik.kiama.attribution.Attribution

class MathExample extends Attribution {

 val value: CachedAttribute[Exp, Double] = attr {
   case Num(i) => i
   case Add(l, r) => value(l) + value(r)
   case Sub(l, r) => value(l) - value(r)
   case Mul(l, r) => value(l) * value(r)
   case Div(l, r) => value(l) / value(r)
 }

}

The higher-ordered function attr from the Kiama library automatically caches the evaluation. Hence, computation-intense calculations on the constructed AST run at most once for each element. Furthermore, the developer does not need to think about the details of AST traversal, reference resolution, and forwarding. This is all done by Kiama and the clever construction of the AST from of case classes (supporting all types of collections as fields) with an overlay graph.


Rewriting

Rewriting in the domain of RAGs permits model updates in terms of graph rewrites that in turn update analyses. E.g., one may want to optimize expressions, like, 3 * 0 to 0 or 0 - 1 to -1 before actually evaluating it. This can be achieved with the following extension to the class MathExample shown above:

import org.bitbucket.inkytonik.kiama.attribution.Attribution
import org.bitbucket.inkytonik.kiama.rewriting.Rewriter

class MathExample extends Attribution with Rewriter[Exp, Exp] {
   
 val value: CachedAttribute[Exp, Double] = attr {
   case Num(i) => i
   case Add(l, r) => value(l) + value(r)
   case Sub(l, r) => value(l) - value(r)
   case Mul(l, r) => value(l) * value(r)
   case Div(l, r) => value(l) / value(r)
 }

 def optimise(e: Exp): Exp = rewrite(optimiser)(e)

 lazy val optimiser = bottomup(attempt(simplifier))

 lazy val simplifier =
   rule[Exp] {
     case Add(Num(0), e) => e
     case Add(e, Num(0)) => e
     case Sub(Num(0), Num(v)) => Num(-v)
     case Sub(e, Num(0)) => e
     case Mul(Num(1), e) => e
     case Mul(e, Num(1)) => e
     case Mul(z@Num(0), _) => z
     case Mul(_, z@Num(0)) => z
     case Div(e, Num(1)) => e
     case Div(_, Num(0)) => throw new IllegalArgumentException("Division by 0!")
  }

}

Different rewriting and traversal strategies (e.g., bottomup, attempt) can be utilized. This abstracts from actually having to think about traversing the DSL-specific AST.

But does it blend?

Set-up a small Scala project with SBT and add the required library:

libraryDependencies += "org.bitbucket.inkytonik.kiama" % "kiama_2.12" % "2.2.0"

Then, make it run:

val someMath = new MathExample()
val resultFunc: Exp => Double = someMath.value

val ast: Exp = new MathParser().parse("1+2+3*0")
println("AST: " + ast)

var result = resultFunc(ast)
println("Result before optimization: " + result)

val optimizedAst: Exp = someMath.optimise(ast)
println("Optimized AST: " + optimizedAst)

result = resultFunc(optimizedAst)
println("Result after optimization: " + result)

assert(3 == result)

Exercise 1:

Remove Kiama and all library calls to it for even more pureness. What is required to retain caching of attribute evaluation and rewriting? What classic design-patterns could be used and what are their drawbacks?

TL;DR

Scala makes it easy to build embedded Domain Specific Languages. Reference Attribute Grammars are a handy tool to reason about the resulting model. Monad composition is awesome.

A more Real-world-ish Example

A while ago, during the application process for a German department store chain, I had the task to write a small log analyzer with Scala. It should consume the logs and produce statistics. A success log entry has the following structure:

HH:mm:ss:SS LEVEL - fetch http://www.server/endpoint.json took: timeInMilliseconds ms

A failure log entry has the following structure:

HH:mm:ss:SS LEVEL - error while fetching http://www.server/endpoint.json after timeInMilliseconds ms

The output should look like this:

http://srv1.host.de: avg: 215ms 478x success: 79%

endpoints:

 products avg: 217ms count: 168x success: 81% (32 of 168 failed)

 teaser avg: 210ms count: 154x success: 76% (37 of 154 failed)

 profile avg: 220ms count: 156x success: 81% (31 of 156 failed)


http://srv2.host.de:

...

I did not want to go with plain regular expressions and nested evaluation functions for the statistics as this is neither readable, extendable nor testable.

Hence, I came up with the following meta-model and AST:

object LogTree {

  case class Log(entries: List[LogEntry]) extends Product

  sealed abstract class LogEntry extends Product {
    val level: String
    val msg: String
    val date: String
    val server: String
    val endpoint: String
    val took: String

    val url: String = s"http://$server/$endpoint.json"

    override def toString: String = s"$date $level - $msg"
  }

  case class Info(date: String, server: String, endpoint: String, took: String) extends LogEntry {
    val level: String = "INFO"
    val msg: String = s"fetch $url took: $took ms"
  }

  case class Warn(date: String, server: String, endpoint: String, took: String) extends LogEntry {
    val level: String = "WARN"
    val msg: String = s"error while fetching $url after $took ms"
  }

}


Parsing

import java.net.URL
import java.text.SimpleDateFormat

import org.bitbucket.inkytonik.kiama.parsing.{Error, Failure, Parsers, Success}
import org.bitbucket.inkytonik.kiama.util.{Positions, StringSource}

class LogParser extends Parsers(new Positions) {

  private val Info = "INFO"
  private val Warn = "WARN"
  private val Separator = "-"
  private val Fetch = "fetch"
  private val Took = "took:"
  private val ErrorWhileFetching = "error while fetching"
  private val After = "after"
  private val DateFormat = "HH:mm:ss:SS"
  private val Json = ".json"
  private val Url = "[a-zA-Z0-9:/\\.]*\\.json"
  private val Ms = "ms"
  private val Date = "[a-zA-Z0-9:]*"
  private val Number = "[0-9]*"

  import LogTree._

  def parseAll(in: List[String]): Log = Log(in.map(parse))

  private def parse(in: String): LogEntry = parse(expr, StringSource(in)) match {
    case Success(s, _) => s
    case Failure(f, _) => throw new RuntimeException("Parsing failure: " + f)
    case Error(e, _) => throw new RuntimeException("Parsing error: " + e)
  }

  private lazy val expr: Parser[LogEntry] = date ~ level ~ separator ~ msg ^^ {
    case d ~ Info ~ Separator ~ m => new Info(d, m._1, m._2, m._3)
    case d ~ Warn ~ Separator ~ m => new Warn(d, m._1, m._2, m._3)
  }

  private lazy val separator: Parser[String] = whitespace ~> Separator <~ whitespace

  private lazy val level: Parser[String] = whitespace ~> (Warn | Info)

  private lazy val msg: Parser[(String, String, String)] = (fetchR | errR) ~ url ~ (tookR | afterR) ~ took ^^ {
    case Fetch ~ u ~ Took ~ t => (u._1, u._2, t)
    case ErrorWhileFetching ~ u ~ After ~ t => (u._1, u._2, t)
  }

  private lazy val fetchR: Parser[String] = Fetch <~ whitespace
  private lazy val errR: Parser[String] = ErrorWhileFetching <~ whitespace
  private lazy val tookR: Parser[String] = whitespace ~> Took <~ whitespace
  private lazy val afterR: Parser[String] = whitespace ~> After <~ whitespace

  private lazy val took: Parser[String] = Number.r <~ whitespace ~ Ms

  private lazy val url: Parser[(String, String)] =
    Url.r ^^ (u => {
      val parsedUrl = new URL(u)
      val file = parsedUrl.getFile
      (parsedUrl.getHost, file.substring(1, file.lastIndexOf(Json)))
    })

  private lazy val formatter = new SimpleDateFormat(DateFormat)

  private lazy val date: Parser[String] = Date.r ^^ (s => formatter.format(formatter.parse(s)))
}

As you may have noticed, writing parser combinators is definitely not the holy grail as soon as your grammar gets more complex. Here, some nasty regular expressions are still around and approaching dates manually is not the task you wish for before having your first (three) coffee(s).

Evaluation

On the other hand, the evaluation is now quite simple due to the strict de-composition into smaller units of work. This results in better testability and extensibility.
import LogTree.{Info, Log, LogEntry, Warn}
import org.bitbucket.inkytonik.kiama.attribution.Attribution

class LogTreeAnalyzer(log: Log) {

  def printAllStatistics(): Unit = {
    uniqueServers(log).foreach(s => {
      val serverName = s
      val avg = avgForServer(serverName)(log)
      val numEntries = numEntriesForServer(serverName)(log)
      val succ = succRateForServer(serverName)(log)

      println(s"$s: avg: ${avg}ms ${numEntries}x success: $succ%")
      println("\tendpoints:")

      val uniqueEndpoints = uniqueEndpointsForServer(serverName)(log)
      uniqueEndpoints.foreach(ep => {
        val endpointName = ep
        val avgEP = avgForEndpoint((serverName, endpointName))(log)
        val numEntriesEP = numEntriesForEndpoint((serverName, endpointName))(log)
        val succEP = succRateForEndpoint((serverName, endpointName))(log)
        val numFailedEP = numFailuresForEndpoint((serverName, endpointName))(log)

        println(s"\t\t$endpointName avg: ${avgEP}ms count: ${numEntriesEP}x success: $succEP% ($numFailedEP of $numEntriesEP failed)")
      })
    })
  }

  val uniqueServers: (Log) => List[String] = attr {
    l => l.entries.foldLeft(Nil: List[String]) { (acc, next) => if (acc contains next.server) acc else next.server :: acc }
  }

  val entriesForServer: (String) => (Log) => List[LogEntry] = paramAttr {
    server => {
      l => l.entries.filter(_.server == server)
    }
  }

  val avgForServer: (String) => (Log) => Int = paramAttr {
    server => {
      l => {
        val entries = entriesForServer(server)(l)
        entries.map(_.took.toInt).sum / entries.length
      }
    }
  }

  val numEntriesForServer: (String) => (Log) => Int = paramAttr {
    server => {
      l => entriesForServer(server)(l).size
    }
  }

  val succRateForServer: (String) => (Log) => Int = paramAttr {
    server => {
      l => {
        val totalEntries = numEntriesForServer(server)(l)
        val successfulEntries = entriesForServer(server)(l).count {
          case _: Info => true
          case _: Warn => false
        }
        val rate = successfulEntries * 100f / totalEntries
        Math.round(rate)
      }
    }
  }

  val uniqueEndpointsForServer: (String) => (Log) => List[String] = paramAttr {
    server => {
      l => {
        val entries = entriesForServer(server)(l)
        entries.foldLeft(Nil: List[String]) { (acc, next) => if (acc contains next.endpoint) acc else next.endpoint :: acc }
      }
    }
  }

  val avgForEndpoint: ((String, String)) => (Log) => Int = paramAttr {
    addr => {
      l => {
        val server = addr._1
        val endpoint = addr._2
        val entries = entriesForServer(server)(l).filter(_.endpoint == endpoint)
        entries.map(_.took.toInt).sum / entries.length
      }
    }
  }

  val numEntriesForEndpoint: ((String, String)) => (Log) => Int = paramAttr {
    addr => {
      l => {
        val server = addr._1
        val endpoint = addr._2
        entriesForServer(server)(l).count(_.endpoint == endpoint)
      }
    }
  }

  val numFailuresForEndpoint: ((String, String)) => (Log) => Int = paramAttr {
    addr => {
      l => {
        val server = addr._1
        val endpoint = addr._2
        entriesForServer(server)(l).count {
          case i: Warn if i.endpoint == endpoint => true
          case _ => false
        }
      }
    }
  }

  val succRateForEndpoint: ((String, String)) => (Log) => Int = paramAttr {
    addr => {
      l => {
        val server = addr._1
        val endpoint = addr._2
        val successfulEntries = entriesForServer(server)(l).count {
          case i: Info if i.endpoint == endpoint => true
          case _ => false
        }
        val totalEntries = numEntriesForEndpoint((server, endpoint))(l)
        val rate = successfulEntries * 100f / totalEntries
        Math.round(rate)
      }
    }
  }
}

Exercise 2:

Implement such a log analyzer with Xtext. Implement such a log analyzer with MPS. How could the evaluation / printing of statistics be implemented in these tools? Discuss the benefits and drawbacks of having a full-fledged meta-model around (e.g., within Xtext as Ecore model).

Summary or What did we learn?

Dealing with parsing and the semantic analysis of languages requires the language engineer to carefully craft compilers, interpreters, code generators, refactoring tools, and static code analyzers. Language workbenches, e.g., Xtext or MPS help you with these tasks by offering the generation of important ingredients (parsers, ASTs, editors, code generators).

If you do not have the time to learn these tools in depth or are technically restricted in some way (e.g., technical resources) embedded parsing and evaluation may be applicable to your embedded Domain Specific Language.

This article showcased three powerful concepts to do so:

  1. The embedding into a statically typed host language (Scala) for high interoperability and no need for an own pre-processor or language extension as Scala greatly supports domain-specific notations.
  2. Parser combinators allowing for a compact language-integrated scanning and parsing.
  3. Reference Attribute Grammars with declarative equations that relate the values of attributes of grammar symbols to each other.
Feel free to contact me or itemis for more in-depth information about Xtext and language engineering in general. And, of course, I am happy to see your solutions to the exercises given above.

Comments