My programming escapades: Parsing

Showing posts with label Parsing. Show all posts

Monday, November 26, 2012

Cool Monday: Functional compilers and atoms

I've seen this great talk by Daniel Spiewak on Functional Compilers. He talks about lexical and semantic analysis in particular.
First, problems with traditional lexing with scanner. You can only have regular tokens or you have do do some dirty hacking and backtrace the scanner therefore losing linearity. And you can solve this with scannerless parsing - putting regular expressions into your grammar. In fact this approach seems simpler to me, as the only proper parser I've done works this way. But this is not the interesting part.

Semantic analysis

This is where fun kicks in. After you parse the source code into an AST you need to do a bunch of operations on it. Naming, typing(even optimization in later phases). If I want to stay functional(which usually I do) my instinct tells me to do recursive traversal and rewrite the tree. And that's exactly what my language does. But there is one huge problem. AST is not a tree. It's huge misnomer. AST is just a spanning tree in the program graph. See, when you add stuff like let expressions, or types(what I'm doing currently) you get problems

let a = 1
in f a

There are edges between the siblings. Or going back up. Or skipping levels. Definitely not trees. These edges may be implicit but you still have to store the information. Traditional solution to this is to compute look-up tables(maps) and carry them along with the tree. So the AST remains a tree but it has some additional stuff that implicitly makes it into a graph. Problem is this gets nasty when you carry along a lot of information and you have to be careful with you updates.
There is one more solution. Vars. Works like a charm. Except that it's terrible to reason about quite the opposite of functional. But there exists a fix.

Atoms

Think write-once vars. But not quite. The idea is to have containers that can be written to but are only ever seen in a one state. Problem with vars is that they can be seen in multiple states and you have to keep track of these states. Vals solve this by not letting you mutate state. And lazy vals provide machinery to delay initialization(great for solving circular dependencies). But they don't let you escape the scope. Or deal with a situation when you need to init them when you have data not when you need to read them. And this is the problem in a compiler. You compute data coming from out of the scope and you need to store it. And some time later you need to read it. And you use atoms. First some code, then explanation.

class Atom[A] {
  private var value: A = _
  private var isSet = false
  private var isForced = false
    
  protected def populate(){
    sys.error("cannont self-populate atom")
  }
    
  def update(a: A) {
    if (!isSet || !isForced){
      value = a
      isSet = true
    }
  }
    
  def apply(): A {
    isForced = true
       
    if (isSet) {
      value
    } else {
      populate()
            
      if (!isSet) {
        sys.error("value not set")
      }
      value
    }
  }   
}

Here is the workflow... you create an atom, you can write(update) to it-in fact writes are indempotent and you can do many successive writes as long as you don't read the value. Once written to isSet flag is set and atom can be read(apply method) setting the isForced flag. If the atom isn't set when you try to read it it will try to populate itself. Populate method is intended to be overwritten and may contain data in it's closure or even perform some side-effects. And you can safely assume it will only execute once. And if everything fails and atom isn't set you get an error. Yay no bothering with nulls any more.
You can quickly see how atoms are great containers for storing computed information in the AST for passing it on to the later stages.

Monday, October 8, 2012

Making a programming language: Part 7b - using objects

Table of contents, Whole project on github

Something like EPIC FAIL occured to me and I published a post containing only half the content I intended to write. So I'm doing a part b.

My intended usage of objects is something along the lines of

objectName.someProperty
objectName.someFunction()
someFunction().someProperty
someObject.someProperty.someFunction().someProperty.someFunction

Explanation

getting a value from an object
invoking a function contained in an object
getting a value from returned object of the invoked function
a bit contrived example. Invoking a function contained inside a property(object) of an object and then getting a function value from a property of the returned value from the first function. That's a mouthful, just read the damn code instead

Dot access

So everything bases on those little dots. First my thoughts were something like "you just do expr <- expr | expr.expr". This is just wrong. At least I should have reversed the order as this leads to infineite left recursion. Then I might have got away. Then I realized I only need dots after function calls and simple identifiers. Design choice(if you think it's a bad one leave a comment). Notice the "simple identifier". That's what I did: Renamed identifier to simple identifier and put something that handles dots under name identifier. And then fixed everything.

case class DotAccess(lst: List[Expression]) extends Expression

private def identifier: Parser[DotAccess] = 
  rep1sep((functionCall | simpleIdentifier), ".") ^^ DotAccess.apply

That's about it. At least for parsing. Now the fun begins.

Nesting scopes

Scopes were designed with nesting in mind. This is a double edged sword. See, the "privates" can be done if you rely on not being able to access the parent scope. If dot access exposes full addressing functionality a powerful feature ceases to exist. So some protection should me in place. Something like strict get

class SScope ...
 def getStrict(key: String): Option[Any] = map.get(key)
...

And I also added an unlinked view to it just to ease usage. This is just a method that returns new SScope with no parent overriding getters and put to use map available in closure.
So now I can walk down the list in DotAccess recursively and explicitly override the implicit scope parameter. And everything automagically works. Well, not quite. If you have a function call, the arguments need to be evaluated in top scope. Not in the nested one like the function identifier. At first I didn't even think about this and only failing attempts at more complex recursion brought up this quite obvious bug.
So how to solve this? I could pre-evaluate all arguments, but I use recursion to do this and it's two levels(at least) deeper from where dots happen. So no go. I need to carry on the outer scope. I overloaded the apply method from Evaluator so other code can still function(tests ftw!) and all in all it looks like this:

def apply(e: List[Expression])(implicit scope: SScope): 
    Any = {
    (e map apply).lastOption match {
      case Some(a) => a
      case None => ()
    }
  }

  def apply(e: Expression)(implicit scope: SScope): Any = apply(e, None)(scope)

  def apply(e: Expression, auxScope: Option[SScope])(implicit scope: SScope):     Any = e match {
    ...
    case DotAccess(list) => {
      val outerScope = scope
      def step(list: List[Expression])(implicit scope: SScope): Any =
        list match {
        case Nil =>        
          throw new ScratInvalidTokenError("got empty list in DotAccess")
        case elem :: Nil => apply(elem, Some(scope))(outerScope)
        case head :: tail => apply(head) match {
          case s: SScope => step(tail)(s.unlinked)
          case other => 
            throw new ScratInvalidTypeError("expected scope, got " + other)
        }
      }
      step(list)
  }
}

So an optional aux scope is the answer. It doesn't seem pretty to me, but it does the job.

next: trying to go faster

Thursday, September 27, 2012

Making a programming language: Part 8 - going faster

Table of contents, Whole project on github

First of all, I wrote some tests for scrat. That was a bit challenging to get started. How do you test a language? I decided to write a bunch of programs that exercise and combine different language features. And then stare into code until I was absolutely sure they are correct. That's the problem with implementing a new language, nobody can tell you if your code is correct but your software, but you don't even know if software is correct.
So I wrote some test programs and parsed and evaluated them in my mind to produce final result. And then I wrote a class using ScalaTest that generates tests by iterating over the array of these tuples. Quite cool, oh and I also included some descriptions into the tuples. So I get nice output. I pondered this idea quite some time ago but finally implemented it a bit after functions and objects.

So now I can do crazy refactorings while still maintaining the language - I trust my tests! But first of all, some benchmarks. I decided to measure execution time of parsing and evaluation on the linked-list consturctor. To my surprise, parsing was quite slow. It started on about 800ms and dropped down to 200ms after a few executions(first time object creation and JIT I suppose). 200ms for 67 lines of code? That would mean about 3 seconds on a 1000 line file IF complexity is linear(which I later leard it isn't!). And that's 3 seconds on fifth run or so, first run(which is the only one when doing real stuff) would be 10 seconds +, unacceptable. (evaluation takes a few ms)

Research

At first I just gave it some thought. Well it's a recursive descent parser, and from what I remember it back traces on failure. So efficiency has a lot to do with grammar structure. And it won't be linear because you have to do (usually) more back tracing when dealing with longer input.

Internets here I come.

My thoughts were confirmed. I found some guys on forums complaining over speed and then Mr. Odersky himself commented something like this(from memory):

Well the parsers in scala standard library are more like an example how to do recusive descent, functional style. They are perfectly usable for parsing command lines but not for long files. You should use a parser generator for that.

I was bummed. The reason I didn't use a parser generator was this close integration that parser as a library could provide me. By the way implementation of Parsers is remarkably short, ~800 lines but most of them are comments. But it has quite some problems. A lot of object creation - every time a parsing function is invoked many object are created. Each parser is an object and each combinator is an object too. This in itself is not a big deal, but no memoization is performed, so it becomes a big deal. Now many objects are created on each try, so when backtracing you have to GC all these object and then recreate them. It's clean but it's no wonder it's slow.
Some time passed by and I accidentally found out about Packrat Parsing. This paper provides details but the gist of it is to use lazy evaluation and memoization to reduce object creation and speed things up.

Conversion and results

Conversion is dead easy. It's fully described in scala api documentation. Basically you mix in PackratParsers and change def : Parser[] = ... to lazy val : PackratParser[] = ... and that's it. Mixed in trait provides the necessary implicit conversions, lazy makes sure the creation is only done once and new implementation of parseAll does some clever parsing. Oh and you needn't convert all parsers, the paper says the optimal perfomance is achieved with the right mix of standard recursive descent parsers and packrat(packrat does include some overhead on specific grammars and inputs). But I just converted all and run the benchmark again. And behold...11ms. On the best run. More like 15 on average. But that's still more than a whole order of magnitude faster. And it should scale better.

next: don't know yet. I caught up with my implementation(finally!). I thinking about making functions stronger by relaxing the rules of invocation and doing some syntax sugar for lambdas. That and java interop. Or possibly compiling to bytecode.

Saturday, September 1, 2012

Making a programming language: Part 5 - variables and decisions

A typical text terminal produces input and displays output and errors (Photo credit: Wikipedia)

Table of contents, Whole project on github

In Part 4 I managed to create a Hello World. What's the next program after this in every programming tutorial? A program that asks your name and greets you. Greeter perhaps?

Reading from standard input in pretty trivial, just wrapping up readLine function from scala, see previous post on how this is done. And I called this function readln.

Variables

I could cheat a bit and write something like this

println("who are you?")
println("hello", readln())

since I don't really have to store the name. This works but I also want to make a program that responds differently to different input, e.g. simplified authentication. So I want to store the input. Something like

print("enter passcode")
input = readln()

So I first create a case class to represent this

case class Assignment(to: Identifier, from: Expression) extends Expression

parsing isn't that hard either

private def assignment: Parser[Assignment] = identifier ~ "=" ~ expr ^^ {
  case id ~ "=" ~ exp => Assignment(id, exp)
}

private def expr: Parser[Expression] = sum ||| assignment

Notice again my abuse of ||| where reverse order would have sufficed, don't do that.

Evaluation...here I had to make a choice. Where to put the assigned variables. In the same bucket as constants. Global variables ARE evil but I don't care since I'm having fun and I do it anyway. Don't worry this changes when I introduce function definitions and objects.

So StdLib became ScratRuntime and has a mutable map for storing values of identifiers. Full source here if you're interested. Of course some changes had to be made because of this refactoring. Full commit diff here.

Evaluation of assignment is now simple map put

case Assignment(name, exp) => {
  val e = apply(exp)
  runtime.identifiers.put(name.id, e)
  e
}

The expression returns the assigned value so you can do a=b=1(in fact I didn't try that yet)

What if

If expressions. I wasn't in the mood for parentheses or brackets so I went ruby style. Except my expressions are single line only(for now) so I had to put everything in one line. I needed a separator between the predicate and positive value and as an added bonus I didn't need the "end". So not much like ruby anymore but it was good inspiration.

I considered booleans but ultimately I can implement everything still in doubles(too much work to make whole type system of primitives when you can have just one type). Zero is False and everything else is True. So AND becomes * and OR becomes -. I'll add some equality so I can compare strings too.

I didn't need inequalities until now(didn't even think about before writing this post) so they aren't in the language yet.

Sample if expression

if input1*input2 then "okay" else "nooooooo"

See, no parens. And I was inspired by scala to make the else part mandatory.

So how do I implement this? Case class is quite trivial, it has tree expressions: predicate, true value and false value. Evaluation just evaluates the predicate(recursion!!) and then evaluates and returns the appropriate expression

case IfThenElse(pred, then, els) => apply(pred) match {
  case d: Double => if (d != 0) apply(then) else apply(els)
  case other => throw new ScratInvalidTypeError("expected a number, got "
                                                   + other)
}

That error is just a class that extends exception.
I added some more case classes and evaluation cases for Equals and NotEquals. They're very simple so I won't include them here(diff on github)
Parsing on the other hand is more interesting, here's the changed part.

private def ifThenElse: Parser[IfThenElse] =
  "if" ~ expr ~ "then" ~ expr ~ "else" ~ expr ^^ {
  case "if" ~ predicate ~ "then" ~ then ~ "else" ~ els => IfThenElse(predicate, then, els)
}

private def equality: Parser[Expression] = 
  noEqExpr ~ rep(("==" | "!=") ~ noEqExpr) ^^ {
  case head ~ tail => {
    var tree: Expression = head
    tail.foreach {
      case "==" ~ e => tree = Equals(tree, e)
      case "!=" ~ e => tree = NotEquals(tree, e)
    }
    tree
  }
}

private def noEqExpr: Parser[Expression] = sum ||| assignment ||| ifThenElse

private def expr = noEqExpr ||| equality

layers (Photo credit: theilr)

I finally got around to understand grammars a bit more. In order to make the grammar not ambiguous an not left recursive you stack layers upon layers. Like onion. And you use these layers to separate layers of precedence - most lightly binding operations being the upper layer(expr). This gets rid of ambiguity. And you can also use these layers to deal with recursion. Something in a deeper layer can contain an item from a higher layer - thus recursion - but it must first match something to avoid infinite recursion. And that's it. This is very well explained on compilers course on Coursera and I thought I understood the (very abstract) explanation but the evidence says I did not until I had some hands on experience.

A quick sample in scrat to finish it of

print("enter passcode")
pass = "password"
state = if readln()==pass then "granted" else "denied"
println("access", state)

next implementing user defined functions

Friday, August 31, 2012

Making a programming language: Part 3 - adding features

Table of contents, Whole project on github

So now I have a repl that can evaluate stuff like

(2+3)*(7/2-1)

Not much of a programming language - more like a calculator, and not even a good one. Lets add some features!

Constants

Like pi, e and such. I have to change the grammar to match identifiers to.

Now I have

private def factor: Parser[Expression] = number | ("(" ~> expr <~ ")")

And I change that to

private def factor: Parser[Expression] = value | parenExpr 

private def value: Parser[Expression] = number | identifier

private def identifier: Parser[Identifier] = "[a-zA-Z]\\w*".r ^^ {
  s => Identifier(s)
}

and I also added a new token type

case class Identifier(id: String) extends Expression

and enabled this in evaluator

case Identifier(name) => {
  StdLib(name) match {
    case Some(v: Double) => v
  case _ => throw new SemanticError(name + " not found")
  }
}

StdLib is just a map object. I went the python way - variables(and constants) are entries in a global dictionary. Just a decision I made whil implementing this. As I said, I don't have a plan, I don't know what I'm doing and I don't know how stuff is done. How hard can it be?! (Top Gear anyone?)

Exponentiation

Another math feature. It's more of a test for myself to see if I understand grammars. Especially how to make a grammar not left-recursive. Because apparently such grammars don't work with RDP. I turned out I don't understand grammars.

private def exponent: Parser[Expression] = (value | parenExpr) ~ "^" ~ (value | parenExpr) ^^ {
  case a ~ "^" ~ b => Exponent(a,b)
}

private def factor: Parser[Expression] = (value ||| exponent) | parenExpr

The ||| operator is an ugly hack. It tries both sides and returns the longer match. By the time I was writing this I didn't know that order is importat. If I just wrote exponent | value it would have worked, because expoenent would match a value anyway and then failed on missing ^.

Token and evaluation(uses math.pow) for this are quite trivial.

Function calls

case class ExpList(lst: List[Expression]) extends Expression
case class FunctionCall(name: Identifier, args: ExpList) extends Expression

Simple: function call is a name and list of expressions to evaluate for arguments(wrapped because even an expression list is an expression):

Parser:

private def arglist: Parser[ExpList] = "(" ~> list <~ ")"

private def functionCall: Parser[FunctionCall] = identifier ~ arglist ^^ {
  case id ~ args => FunctionCall(id, args)
}

private def value: Parser[Expression] = number | (identifier ||| functionCall)

Again, I was having trouble - parser just didn't work and resorted to |||. functionCall should come before identifier.

Evaluating this is more interesting. I decided to make functions be values too for obvious reasons -> higher order functions(I'm into functional programming, remember?). So function values must be stored in same "namespace". StdLib(the only "namespace") required to become of type Map[String,Any]. I will have to do pattern matching anyway since this will be dynamic-typed language. (Yes this is a plan, I think it's easier to implement. Static typing ftw, but that's next project). And I needed a type for function values to pattern match on - I went with Any=>Any and sending in List(arg0,arg1,...) doing more pattern matching inside the function. Will be slow but hey...dynamic typing!

from evaluator

case FunctionCall(name, args) => {
  StdLib(name.id) match {
    case Some(f: FunctionVarArg) => f.apply(apply(args))
    case None => throw new ScratSemanticError("function " + name + "not found")
  }
}

and and example function in StdLib

type FunctionVarArg = Any => Any

lazy val ln: FunctionVarArg = {
  case (d: Double) :: Nil => math.log(d)
  case other => throw ScratInvalidTypeError("expected single double but got " + other)
}

Conclusion

As clearly illustrated above, not planning your grammar results in constant changes in many places. So if you're doing something serious just make the whole fricking grammar on a whiteboard beforehand. Seriously.

Anyway..now I still only have a calculator, but a much more powerful one. I can write expressions like

e^pi
ln(10+2)
1+2*3/4^5-log(e)

But that's nearly not enough. I want to be Touring-complete an ideally to be able to compile/interpret itself.

next: Hello World(strings, printing and interpreter)

Thursday, August 30, 2012

Making a programming language: Part 2 - something that kinda works

Table of contents, Whole project on github, relevant version on github

In the Part 1 I posted a working repl(read-eval-print-loop) for simple math expressions but I kinda cheated and only explained how I built the AST.

AST elements

Just scala case classes

sealed trait Expression
case class Number(n: Double) extends Expression
case class Add(left: Expression, right: Expression) extends Expression
case class Subtract(left: Expression, right: Expression) extends Expression
case class Multiply(left: Expression, right: Expression) extends Expression
case class Divide(left: Expression, right: Expression) extends Expression

Parser combinators revisited

I use power of scala library to cheat a bit and do lexing and parsing in one step.

Basic parser combinators from scala api documentation, everything you need to define productions in your grammar.

p1 ~ p2 // sequencing: must match p1 followed by p2
p1 | p2 // alternation: must match either p1 or p2, with preference given to p1
p1.?    // optionality: may match p1 or not
p1.*    // repetition: matches any number of repetitions of p1

However, to transform the matched string to an AST you need something more

private def number: Parser[Expression] = """\d+\.?\d*""".r ^^ {
  s => Number(s.toDouble)
}

Firstly, in the RegexParser class is an implicit conversion from Regex to Parser. So I could write

private def number: Parser[String] = """\d+\.?\d*""".r

Notice the type annotation. Inferred type would be Regex, since this function is private I can still have implicit conversion, but I rather have all parsers be of type Parser[_].

The ^^ part is an action combinator - a map function. But as ^^ is only available on Parser instances my regex has already been implicitly converted. So in my lambda I already know(scala can infer) the type of s to be String.

One last example

private def term: Parser[Expression] = factor ~ rep(("*" | "/") ~ factor) ^^ {
      case head ~ tail => {
        var tree: Expression = head
        tail.foreach {
          case "*" ~ e => tree = Multiply(tree, e)
          case "/" ~ e => tree = Divide(tree, e)
        }
        tree
      }
    }

Function rep is also from Parsers class matches any number of repetitions(including 0). Here's the type signature

def rep[T](p: ⇒ Parser[T]): Parser[List[T]]

The catch here is that ~ returns a single parser that matches both sides, but fortunately it can be pattern matched to extract both sides. And I can even use meaningful names since I am in fact matching a head with an optional tail.
Inside the case statement I used more imperative style to build a tree, nothing fancy here. Folding was a bit awkward in this case for me(maybe I'm just incompetent) so I went with a for each loop.

Apply the same pattern to +/- part and you have yourself a tree.

Oh, yeah...and the top parser function. A bit changed from last time, to yield useful error messages

def apply(s: String): List[Expression] = parseAll(expr, s) match {
    case Success(tree, _) => Right(tree)
    case NoSuccess(msg, _) => Left(msg)
}

Evaluator

Now what I promised in previous post - evaluation. At first I planned on compiling the code for JVM but I just wanted to see some results first so I decided to do a simple interpreter, no compiling whatsoever - for now.

My first approach was to modify the Expression

sealed trait Expression{
    def eval(): Double  
}

and implement this on all case classes hardcoding the double type and coupling AST representation and evaluation together. Yikes. Granted, it worked, but what an ugly way to do it.

So I did a hard reset(always use git folks! or something similar) and went about doing a standalone evaluator. Since scala's pattern matching skills are awesome and I'm already using case classes why not just do that.

object Evaluator {

    import Tokens._

    def apply(e: Expression): Double = e match {
      case Number(n) => n
      case Add(l, r) => apply(l) + apply(r)
      case Subtract(l, r) => apply(l) - apply(r)
      case Multiply(l, r) => apply(l) * apply(r)
      case Divide(l, r) => apply(l) / apply(r)
    }
  }

This is all the code. Just pattern matching and recursion. But yes, still hardcoded double as the data-type. Looking back, not a great decision...but hey I got something working and this is fun.

next time: Adding features(contsnts, exponents, function calls)

Wednesday, August 29, 2012

Making a programming language: Part 1 - how to start

Source: github

Source version for this post

Lately I gained some interest in programming languages and compilers. Those seem like quite some daunting monsters - just consider the amount of different features and more importantly, the vast infinity of possible programs.
So where to start? I have a course about compilers at my college, but I have to wait another year to enroll into that. So welcome Coursera. It's currently available only in self-study but that's enough for me. I should mention the Dragon Book, but I didn't read that(yet) so I can't comment.

Compiler is basicaly lexer -> parser -> optimiser -> code generator.

The regex approach

I made it through introduction and lesson on lexical analysis and recognized finite automata as something from college(thank your professor!) and finnaly understood the connection from them to regex in java and the like(professor mentioned regular expressions but no-one figured out what was the connection).

Feeling empowered by the newly obtained knowledge I set out to make a lexer for my language. I had no clear specification in mind since this is supposed to be a fun and creative project...I intended to just invent as I go.

My lexer was something like that(won't post the real code since I'm embarrassed)

create a bunch of java regex objects that all match to the start of the string
take the string and try to match it against all regexes
return an identifier object corresponding to the match
remove matched part of the string
loop

Yeah, pretty horrible.
I kinda abandoned this idea.

The recursive descent parser approach

By now I was into functional programming and scala language. I also watched lesson on recursive descent on coursera. The code was fugly, bunch of c++ with pointer arithmetic and side effects. I wan't pretty functions :(

I considered doing a framework for this in scala or perhaps java but...

Scala (programming language) (Photo credit: Wikipedia)

Enter scala's parser combinators. I suggest reading this if you arent familiar. One of the reasons scala is awesome. You get parser "generation" baked into the standard library. Granted, it's a bit slow, but who cares - this is a project for fun not for general usage. On the PLUS side you get productions in the laguage itself and parsing is almost like magic.

And in scaladoc you even get a nice code snippet. What more can a geek ask for.

Well....an AST would be nice. Fortunately scala also provides action combinators that allow to pattern match the parsers themself and map them into an AST. And it even isn't that complicated.

I recreated the grammar from the scaladoc example and added the action combinators. The code block is kinda long, but I promise there is a short explanation at the bottom.

Notice the ^^ symbol. It sends the output from the parser combinator into my mapping function(anonymous). And I just create the case class.
There is also an evaluator at the and...but that's in part 2.

Monday, November 26, 2012

Semantic analysis

Atoms

Related articles

Monday, October 8, 2012

Dot access

Nesting scopes

Related articles

Thursday, September 27, 2012

Research

Conversion and results

Related articles

Saturday, September 1, 2012

Variables

What if

Related articles

Friday, August 31, 2012

Constants

Exponentiation

Function calls

Conclusion

Related articles

Thursday, August 30, 2012

AST elements

Parser combinators revisited

Evaluator

Related articles

Wednesday, August 29, 2012

Table of contents

Source: github

The regex approach

The recursive descent parser approach

Related articles