LITHINOS - Intro to clj-peg

Intro to clj-peg

09.07.03

This page is out of date by at least several releases. You should refer to the latest User Manual found on the clj-peg project page.

The clj-peg library is meant to facilitate the use of Parsing Expression Grammars in Clojure. This post will introduce the one main macro and several helper functions defined in the core library as well as walk through some sample code.

Setup

Extracting the clj-peg release, you'll notice there are two JAR files inside (clj-peg.jar and clj-peg-samples.jar). To begin with, when you compile the sample code, you'll want to make sure both JAR files are on your classpath along with the Clojure JAR. I don't think I'm on the 1.0 release of Clojure, but the difference shouldn't matter.
For this intro, I've made a folder inside the extracted clj-peg release. I've called that folder 'test-intro'. Let's go through the basic setup and I'll show you the few files that I've put in the test-intro folder.
First up is the Ant build file. Place the following code in a file named 'build.xml' and put that file in the test-intro folder. (Ofcourse you might need to adjust some of the paths to the JAR's. Just change the property declarations to suit.)

<project name="default" default="default">

	<target name="default">
		<antcall target="compile" />
		<antcall target="run" />
	</target>

	<property name="clj_libs" value="/home/user/opt/clojure/" />
	<property name="clojure_jar" value="${clj_libs}/clojure.jar" />
	<property name="clj_peg_jar" value="../clj-peg.jar" />
	<property name="clj_peg_samples_jar" value="../clj-peg-samples.jar" />
	<property name="classes" value="classes" />
	<property name="app_file" value="test/main.clj" />
	<property name="app" value="test.main" />
	<property name="app_jar" value="test.jar" />
	<property name="vendor" value="LITHINOS" />
	<property name="title" value="Test" />
	<property name="version" value="0.1" />

	<target name="compile">
		<fail message="App file not found"><condition><not><available file="${app_file}" /></not></condition></fail>
		<echo message="Compiling ${app}" />
		<mkdir dir="${classes}" />
		<java classname="clojure.lang.Compile" fork="true" failonerror="true">
			<classpath>
				<pathelement location="." />
				<pathelement location="${classes}" />
				<pathelement location="${clojure_jar}" />
				<pathelement location="${clj_peg_jar}" />
				<pathelement location="${clj_peg_samples_jar}" />
			</classpath>
			<sysproperty key="clojure.compile.path" value="${classes}" />
			<arg value="${app}" />
		</java>
		<jar destfile="${app_jar}" basedir="${classes}" index="true">
			<zipfileset src="${clojure_jar}" includes="**/*.class" />
			<zipfileset src="${clj_peg_jar}" includes="**/*.class" />
			<zipfileset src="${clj_peg_samples_jar}" includes="**/*.class" />
			<manifest>
				<attribute name="Implementation-Vendor" value="${vendor}" />
				<attribute name="Implementation-Title" value="${title}" />
				<attribute name="Implementation-Version" value="${version}" />
				<attribute name="Main-Class" value="${app}" />
				<attribute name="Class-Path" value="." />
			</manifest>
		</jar>
		<delete dir="${classes}" />
	</target>

	<target name="run">
		<fail message="App jar not found"><condition><not><available file="${app_jar}" /></not></condition></fail>
		<echo message="Running ${app}" />
		<java jar="${app_jar}" fork="true" failonerror="true" />
	</target>

</project>

Next you'll want to create a single folder inside test-intro let's call the new folder 'test'. For now, let's have the contents of the test folder be a single file called 'main.clj'. This file should contain the following:

(ns test.main
    (:gen-class)
    (:use (com.lithinos.clj-peg core))
    (:use [com.lithinos.clj-peg.samples	:only (create-string-input)]))
 
(defn -main [& args]
    (println "Works"))

At this point, if you run the 'ant' command inside the test-intro folder, you should see:

run:
     [echo] Running test.main
     [java] Works

Now that we know we have our environment setup correctly, let's get to the meat of this post.

Motivation

Recently there were a few emails posted to the Clojure Google Groups mailing list about verifying balanced brackets/parens. While there are certainly more idiomatic approaches, let's use that challenge as the basis for our intro.
The basics of the original posting is that valid expressions only have the following four characters: (, ), [, and ]. All valid expressions will also be composed of balanced matching pairs. So, the expressions "()[]" and "([])" are valid, while "([)]" and "([)" are not.

Begining Use

Let's start using clj-peg by adding the following code above our main function declaration:

(eval '(make-parser {
    :supress-header true
    :main           balanced?
    :main-doc       "Returns true for balanced expressions composed only of matching pairs of the characters '(', ')', '[', and ']'."
    :main-ast       (fn [ast] (try ast true (catch Error e false)))
    :create-input   create-string-input
    :rules (
        Expr <- #".+"
    )}))

This is simply a call to the make-parser macro. Some of the keys are required and others are optional. If you really want to be overloaded with debug-like information, you could also set the :debug key to true. The :supress-header key is a form of debug, that doesn't serve much purpose now, but should serve more in the future. The :main key is a symbol that will become the main entry point for what is generated from the grammar. You'll use this to have the generated functions process some input. The :main-doc key is tied through meta-data to the main function, so that calls to (doc balanced?) return a useful description.
The :main-ast key points to a function that processes the final result of all other AST processing functions. AST processing functions can be attached anywhere in the grammar using the =ast function. While this post won't contain any calls to =ast, the concept is simple enough that it doesn't need much explanation, and might be clarified in a future post.
The :create-input key is where the real magic is. While the function that the :create-input key points to must return a valid input struct, what that struct does is completely up to the developer. This part of the clj-peg library will most certainly be the topic of a future post.
The :rules key takes a list that is eventually broken down into triplets. The first element in any triplet is a symbol that becomes a function representing that branch of the grammar. This element is traditionally referred to as a non-terminal. The second element is completely ignored. I chose to use the characters <- since it matched most pseudo-grammars that I've seen. You could probably use any sequence of characters that you wanted, but Clojure reader macros might cause problems. The third element is the definition of the body of the non-terminal. This body can contain pretty much anything, and I'm only going to show how to use the functions provided in the clj-peg core. There will probably be another post detailing the requirements of this third element. Let's get back to our sample code.
The code should still compileand you shouldn't see anything different. We'll need to call one of the generated functions to use our parser, and for now, since we haven't defined the grammar, it will return 'true'. So let's modify the main function so that it looks like the following (and run it to verify that it still returns true):

(defn -main [& args]
	(println (balanced? "123")))

Now that we've got the basics of calling the make-parser macro in place we can finish filling in the grammar. Let's go through some of the other functions from the clj-peg core package that we'll be using.

Functions Provided in Core

Use a call to =o anytime you want to give a short-circuited list of options. For instance, if you want to say, "Either A or B or C, in that order, whichever matches first", you'd type the expression, (=o A B C). If A is a valid match at that point in the grammar, then B and C will never be evaluated.
Use a call to =s anytime you want to give a sequence. For instance, if you want to say, "Now match A then B", you'd type the expression, (=s A B). Each part must match for the whole to match. If A matches then B is tested next and must match the input that remains after A for the sequence to match.
The expressions =*, =+, and =? act like their regex counterparts. If you want zero or more of something, type the expression, (=* something). If you want one or more of something, type the expression, (=+ something). If you want zero or one of something, type the expression, (=? something).
If you want to say that there shouldn't be any more input, you'd type the expression, (=e). Don't forget the parenthesis, =e is a function.
As for the leafs of your grammar, when you use the create-string-input function provided in clj-peg-samples, you can have a Clojure regex or string. If you provide your own create-input function you're free to do anything you'd like.

Wrapping Up

With those introductions out of the way, here's the body of the rules:

Expr         <- (=s BalancedExpr (=e))
BalancedExpr <- (=o BalancedSeq Empty)
BalancedSeq  <- (=+ (=o ParenExpr BracketExpr))
ParenExpr    <- (=s LParen BalancedExpr RParen)
LParen       <- "("
RParen       <- ")"
BracketExpr  <- (=s LBracket BalancedExpr RBracket)
LBracket     <- "["
RBracket     <- "]"
Empty        <- ""

Let's spice up our main function a bit by changing it to the following:

(defn -main [& args]
    (doseq [expr [
                    "" "[]" "()" "[()]" "([])" "[]()" "()[]" "([][])" "[()()]"
                    "[]()[()]([])[]()()[]([][])[()()]"
                    "([" ")]" "[(" ")]"
                    "asd" "123"
                    "[]asd" "()asd" "[()]asd" "([])asd" "[]()asd" "()[]asd" "([][])asd" "[()()]asd"
                    "asd[]" "asd()" "asd[()]" "asd([])" "asd[]()" "asd()[]" "asd([][])" "asd[()()]"]]
        (printf "%40s %s %s%n" expr (if (balanced? expr) "is" "is not") "balanced")
        (flush)))

Here's the whole program:

(ns test.main
    (:gen-class)
    (:use (com.lithinos.clj-peg core))
    (:use [com.lithinos.clj-peg.samples	:only (create-string-input)]))
 
(eval '(make-parser {
    :supress-header true
    :main           balanced?
    :main-doc       "Returns true for balanced expressions composed only of matching pairs of the characters '(', ')', '[', and ']'."
    :main-ast       (fn [ast] (try ast true (catch Error e false)))
    :create-input   create-string-input
    :rules (
        Expr            <- (=s BalancedExpr (=e))
        BalancedExpr    <- (=o BalancedSeq Empty)
        BalancedSeq     <- (=+ (=o ParenExpr BracketExpr))
        ParenExpr       <- (=s LParen BalancedExpr RParen)
        LParen          <- "("
        RParen          <- ")"
        BracketExpr     <- (=s LBracket BalancedExpr RBracket)
        LBracket        <- "["
        RBracket        <- "]"
        Empty           <- "")}))
 
(defn -main [& args]
    (doseq [expr [
                    "" "[]" "()" "[()]" "([])" "[]()" "()[]" "([][])" "[()()]"
                    "[]()[()]([])[]()()[]([][])[()()]"
                    "([" ")]" "[(" ")]"
                    "asd" "123"
                    "[]asd" "()asd" "[()]asd" "([])asd" "[]()asd" "()[]asd" "([][])asd" "[()()]asd"
                    "asd[]" "asd()" "asd[()]" "asd([])" "asd[]()" "asd()[]" "asd([][])" "asd[()()]"]]
        (printf "%40s %s %s%n" expr (if (balanced? expr) "is" "is not") "balanced")
        (flush)))

Final Words

As an almost last word, I'm not sure this library is very usable yet. The runtime is likely exponential, since I haven't added the traditional memoize part of linear runtime PEGs. Along with the modifications to bring the runtime down from exponential, there are other plans. One of the items on my todo list is to keep playing with the syntax, since I still feel like there's too many characters, and it doesn't flow very well.

Both the exponential runtime and the dislike of the syntax have been addressed in the 0.4 release. You'll want to look at the relevant post for more information.

As the actual last word, I'm releasing clj-peg under a modified BSD license. You'll want to make sure you read it (it's short), since you can't use clj-peg for commercial gain. Most of the reason it's under a non-open license is so that I can finish it before accepting others contributions. I should have time in the next few months to get to that point.