How to customize the syntax of OCaml, using Camlp5
Everything you always wanted to know, but were afraid to ask

Copyright © 2005, 2010 Martin Jambon. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the file fdl.txt. The source code of this document is the file extend-ocaml-syntax.html.mlx.

This tutorial is an individual initiative to provide additional documentation for Camlp5, a sophisticated tool for Objective Caml programmers.

2010 revision: This document was updated to reflect the name change of the legacy Camlp4, now called Camlp5. The following table clarifies name issues:

Period	Name of "legacy Camlp4"	Name of "new Camlp4"
before 2007	Camlp4	-
from 2007 (OCaml 3.10)	Camlp5	Camlp4

The examples of this tutorial will not work with the new Camlp4 starting with OCaml 3.10.

Contents
1. What is it about?
2. Is Camlp5 what I need?
3. Intended audience
4. Getting started
5. Which language does Camlp5 speak?
6. What goes in and what comes out
7. Dissection of a syntax expander
8. Variations around the same example
8.1 Using only one quotation
8.2 Removing the hidden reserved identifier
8.3 Adding more constructs
9. Local syntax extensions, transformations of the AST
10. Inserting some code at the beginning of the file
11. Inserting toplevel expressions
12. Inserting hidden expressions which are evaluated once
13. Mastering priorities
14. Local use of external parsers
14.1 Parsing raw blocks of text
14.2 Parsing the stream of tokens
15. Producing useful error messages
16. Suggestions for a better interaction between multiple syntax extensions
16.1 Avoiding strong incompatibilities
16.2 Avoiding name clashes
17. Things you can do
17.1 Catching exceptions only where needed: let try name = expr1 in expr2 with exception-handler
17.2 Read 1/2 as 1. /. 2., but only locally
17.3 Default values for record fields
17.4 Anonymous recursive functions
18. Things you cannot do and workarounds
18.1 Inserting anything, anywhere
18.2 Adding end of line comments
18.3 Adding a notation for raw strings
18.4 Adding Haskell's "infixing" backquotes such as f `map` list
18.5 Adding SML's #n notation to extract field number n of any tuple
19. Common problems
19.1 I can't build a list with something like <:expr< [ $list:my_list$ ] >>
19.2 I can't build a function declaration with something like <:expr< let f $list:args$ = $e$ >>
19.3 Incorrect locations in error reports
19.4 Unbound value _loc (or loc)
19.5 What's wrong with labels: <:expr< f ~$lid:labelname$ >> doesn't work
19.6 Not_found is raised during the preprocessing

1. What is it about?

We are talking about truly modifying the syntax of OCaml. Yes, in theory anyone could modify the syntax of this programming language, without rewriting a whole dedicated parser. Camlp5 is the tool that lets you do this. And many syntax enhancements can be performed in relatively few lines of code.

However, there are quite a few things to know before starting to write your own syntax extension which will implement exactly what you want. This tutorial is meant to address the common difficulties that people encounter when they start using Camlp5 for this purpose. It is essentially based on my recent experience in integrating a dedicated syntax for regular expressions in OCaml and mix this form of pattern matching with the traditional pattern matching of OCaml.

2. Is Camlp5 what I need?

Camlp5 lets you do amazing things that have no equivalent in most other programming languages. You might want to define a domain-specific language (DSL) without wasting your time in developing one more interpreter of poor quality which is not reusable at all. With Camlp5, you may create syntaxic shortcuts for the most common operations that your DSL requires and at the same time benefit from all the qualities of OCaml: automatic type inference, static typing, early detection of errors, precise location of mistakes in your code, most of the advantages of text-editor modes for OCaml, interface with other languages, an interactive interpreter and of course the generation of high performance native code.

Camlp5 is an excellent solution if you want to add a syntax which is a shortcut for something that is obviously, mechanically expandable into standard OCaml without having resort to the type information. Camlp5 lets you work on the abstract syntax tree (AST), which does not contain information on the actual type of the object being manipulated.

If the syntax extension you want to provide requires the knowledge of the type of each object, you can still design a dedicated embedded language that will be compiled into OCaml. Camlp5 provides a convenient mechanism for producing OCaml code, therefore compiling any given language to OCaml is an excellent choice in many cases even if the parsing facilities of Camlp5 are not used.

This tutorial is about learning how to develop your own syntax extensions but you might want to have a look at existing syntax extensions that are available on the web. The main sources for finding Camlp5 extensions are: the Caml hump on the official Caml site, the OCaml Link Database (under "OCaml syntax extension") and P4ck.

3. Intended audience

If you are interested in modifying the syntax of OCaml and you are a bit confused by the official tutorial and manual for Camlp5, then I hope this document will help you get started.

In order to start discovering Camlp5, you need to be fluent in OCaml since it will be our main language (with variants) for doing everything. You need to be familiar with the higher order functions (HOF) of OCaml such as List.map, List.fold_right and List.fold_left since we will use them a lot for manipulating syntax trees. So if you are not familiar with these and functional programming in general, practice a little first.

4. Getting started

You will need a standard installation of OCaml, which should include the OCaml compilers (ocaml, ocamlc, ocamlopt) and the Camlp5 system (camlp5, camlp5o)

As usual for editing and testing OCaml code, you will need a good editing mode for OCaml in your favorite text editor, but we assume you know all about this.

You also need some way to compile automatically your code. I use make with OCamlMakefile. The good thing about make is that if you don't want to be too subtle, you can just write one target and the hardcoded sequence of commands that recompiles everything. This might still be the best choice in many cases. Anyway, don't waste your time with such things.

All the files that are given as examples along this document can be browsed from http://martin.jambon.free.fr/camlp5 or downloaded as a compressed archive.

5. Which language does Camlp5 speak?

One of the main source of confusion for beginners is the presence of multiple languages that are used for different things but which all look more or less like OCaml. We will see later the details about all of this, but just keep in mind that we will use the following languages:

regular OCaml
modified OCaml using EXTEND ... END constructs for defining syntax extensions (we use a predefined syntax extension in order to define our own syntax extensions)
quotations: these look like <:ident< ... >> or just << ... >>. Quotations are a generic syntax extension which are supported in any OCaml code which is preprocessed by Camlp5. The contents of the quotation will be expanded in place according to ident which has to be known by the preprocessor.
revised OCaml syntax. This syntax is different from the standard OCaml syntax. We will use it inside of quotations for building OCaml syntax trees, in combination with
antiquotations which look like $this$ and are arbitrary OCaml code written in the regular syntax which lets you insert automatically generated nodes into the syntax tree being defined by the current quotation.

Now that you are completely confused, you understand why some Camlp5 syntax extensions that may seem natural, simple and readable may be very discouraging when you are a beginner who is trying to use existing code as a template.

6. What goes in and what comes out

We have two kinds of files: those which define a syntax or modify an existing syntax, and the files that are written in this syntax.

Warning: the syntax of a file is never defined within the file itself, but in a separate file

An OCaml file written in the standard syntax does not need to be preprocessed. It is directly compiled into bytecode or native code by one of the OCaml compilers (ocamlc, ocamlopt or ocaml).

Extending the syntax of OCaml simply means that we will use a modified syntax for writing our programs and this non-standard syntax is not understood by the OCaml compilers. Therefore our programs need to be preprocessed by a converter from our exotic syntax into plain old OCaml. Camlp5 provides us with tools for writing our specific preprocessor.

The command-line tool which will serve as a base for building our specific preprocessor is camlp5. camlp5 alone does nothing very interesting for us: we need to feed it with our definition of how to convert our syntax into regular OCaml. This is done by passing object files (.cmo) to camlp5. As a convention, we will name these files according to their role:

pa_ (as in parsing) is used as a prefix for the files that define or modify a grammar, i.e. how the input file should be converted into an abstract syntax tree (AST)
q_ (as in quotation) is used as a prefix for the files that define how to expand the contents of quotations. Quotations are single, predefined tokens in the OCaml syntax but are meant to be expanded into some normal OCaml expression or pattern using arbitrary lexing and parsing rules. These files are not named with the pa_ prefix since technically they do not add rules in the general grammar that we may extend.
pr_ (as in printing) is used as a prefix for the files that define how to export the AST.

Camlp5 provides a file named pa_o.cmo which parses the standard syntax of OCaml (with only one addition, the quotations: see later). It provides a file named pr_o.cmo which converts an OCaml AST into the concrete syntax of OCaml, i.e. a valid source file for the OCaml compilers. Thus the command camlp5 pa_o.cmo pr_o.cmo should read a standard OCaml source file and reprint an equivalent program from the point of view of the compilers:

$ cat hello.ml
print_endline "Hello World!";;
$ camlp5 pa_o.cmo pr_o.cmo hello.ml
let _ = print_endline "Hello World!"

Another useful printing file is pr_dump.cmo. If you try it instead of pr_o.cmo, you will get an unreadable output. This is a binary representation of the AST which can be read back efficiently by the OCaml compilers and more importantly without losing trace of the location of the original tokens in the source file. We will therefore reserve the usage of pr_o.cmo for reviewing the generated OCaml code but not compile it further. camlp5o is a predefined shortcut for camlp5 pa_o.cmo pa_op.cmo pr_dump.cmo: it parses the regular syntax of OCaml and outputs a compiler-friendly representation of the AST. The additional file pa_op.cmo is a predefined syntax extension of OCaml. It actually implements an experimental syntax for streams and parsers which was used in earlier versions of OCaml. The interesting thing to notice here is that we load two files starting with pa_: pa_o.cmo defines the grammar of OCaml from scratch while pa_op.cmo only adds syntax rules to this grammar. This is really a syntax extension: we load different files which will successively create and modify the concrete syntax that is understood by the preprocessor.

Since we are interesting in parsing a syntax which is a modified OCaml and converting it into an OCaml AST, we will always use camlp5o, and tell it to load our pa_*.cmo files.

Warning: don't get confused: the files that define syntax extensions use themselves a modified syntax of OCaml and therefore have to be preprocessed with camlp5o loaded with the adequate files (usually pa_extend.cmo and q_MLast.cmo)

The most important point to remember for now is that the center of everything is the abstract syntax tree. The type of the nodes of this tree is fixed and is the only one which can be understood by the OCaml compilers.

The next step is to see how to add new syntaxic constructs to the grammar of OCaml and how to expand them into the intended AST.

7. Dissection of a syntax expander

For testing our example, we are going to use a Makefile which is merely a sequence of commands. We are going to write two programs: pa_tryfinally.ml defines our syntax extension and prog.ml is a simple test program. Here is the Makefile:

NAME = tryfinally
all:
	camlp5o pa_extend.cmo q_MLast.cmo pr_o.cmo pa_$(NAME).ml \
		-o pa_$(NAME).ppo -loc loc
	camlp5o pa_extend.cmo q_MLast.cmo pa_$(NAME).ml \
		-o pa_$(NAME).ast -loc loc
	ocamlc -c -I +camlp5 -pp 'camlp5o pa_extend.cmo q_MLast.cmo -loc loc' \
		-dtypes pa_$(NAME).ml
	camlp5o -I . pr_o.cmo pa_$(NAME).cmo prog.ml -o prog.ppo
	camlp5o -I . pr_r.cmo pa_$(NAME).cmo prog.ml -o prog.ppr
	ocamlopt -dtypes -o prog -pp 'camlp5o -I . pa_$(NAME).cmo' prog.ml
	caml2html -t -ln pa_$(NAME).ml
	caml2html -t -ln prog.ml

clean:
	rm -f prog *.ppo *.ppr *.cmo *.cmi *.o *.cmx *.ast *~ *.ml.html

We want to add a try ... finally construct which behavior is illustrated by this example:

File prog.ml [html]:

let _ =
  try
    failwith "this is not an error"
  finally
    print_endline "OK"

should be converted into the following program written in the standard syntax of OCaml:

File expected.ml:

let _ =
  let __finally1 =
    try
      failwith "this is not an error";
      None
    with exn ->
      Some exn in
  print_endline "OK";
  match __finally1 with
      None -> ()
    | Some exn -> raise exn

This new syntaxic construct is formed by two keywords (yellow regions) and two expressions (grey regions): try is already a keyword in OCaml and finally is a new keyword. The two expressions are conserved during the conversion to standard OCaml and some auxilliary code is added around in order to achieve the desired effect. The desired effect consists in evaluating an expression e1 first, and then evaluate an expression e2 later, even if the evaluation of e1 raised an exception in which case this exception is re-raised after the evaluation of e2.

In real programs it is often useful for closing an open file at the end of its manipulation even if an error occured. Please note that a more useful version of this syntax extension exists, but it's not the point here.

During the transformation of our program, we introduced new identifiers at three different places (pink). One of them, __finally1 must have a name that is unlikely to interfere with existing names. We had to decide that identifiers starting with __finally are reserved for the syntax expander and should not be used directly. The two other identifiers are named exn at two different places and are not visible in the user-defined code (grey). Therefore it is perfectly safe to use canonical names such as exn, x, s or whatever we like.

Now we will see how to implement this transformation. Here is a solution:

File pa_tryfinally.ml [html]:

(* The function that returns unique identifiers *)
let new_id = 
  let counter = ref 0 in
  fun () ->
    incr counter;
    "__finally" ^ string_of_int !counter

(* The function that converts our syntax into a single OCaml expression,
   i.e. an "expr" node of the syntax tree *)
let expand loc e1 e2 =
  let id = new_id () in
  let id_patt = <:patt< $lid:id$ >> in
  let id_expr = <:expr< $lid:id$ >> in
  <:expr<
  let $id_patt$ =
    try do { $e1$; None } 
    with [ exn -> Some exn ] in
  do { $e2$;
       match $id_expr$ with
           [ None -> ()
           | Some exn -> raise exn ] }
  >>

(* The statement that extends the default grammar, 
   i.e. the regular syntax of OCaml if we use camlp5o 
   or the revised syntax if we use camlp5r *)
EXTEND
  Pcaml.expr: LEVEL "expr1" [
    [ "try"; e1 = Pcaml.expr; "finally"; e2 = Pcaml.expr -> expand loc e1 e2 ]
  ];
END;;

This program is written with a strange syntax: it uses three quotations (grey areas) which start with something of the form <:name< where name is the name of a predefined quotation expander and are terminated by >>. Here we use two different quotation expanders: expr and patt. These quotation expanders are loaded from the file q_MLast.cmo (q_ means quotations, and the rest means ML AST = OCaml abstract syntax tree). The contents of these quotations looks very much like OCaml code but not exactly: it is actually expanded into a representation of the AST using concrete types. Have a look at the program after preprocessing, pa_tryfinally.ppo, in order to see the effect of the quotation expanders.

Warning: The quotations which serve as shortcuts for building nodes of the OCaml AST do not use the usual syntax of OCaml, but must be written in the revised syntax. Unfortunately you will have to learn this new syntax. One way is to read the reference manual of Camlp5. Another way is to convert your own programs to this syntax with camlp5o pr_r.cmo and compare the output with the input.

Warning: The contents of these quotations are written in the revised syntax of OCaml, at the exception of the pieces which appear between dollars ( $...$ ). They are called antiquotations and are way to insert nodes of the syntax tree which have been defined previously.

In the example, id_patt and id_expr are two simple nodes of the AST which are respectively of the types MLast.patt and MLast.expr. They both stand for a lowercase identifier, but once in a pattern and once in a expression. We just said that antiquotations are a pair of dollars containing an OCaml expression which stands for a predefined node of the AST. Actually, in addition we can use labels such as lid: in this portion of our example:

let id_patt = <:patt< $lid:id$ >> in
...

It means that the actual contents of the antiquotation (yellow) is a string which represents a lowercase identifier (lid). Here id has to represent a valid lowercase identifier, which is the case (id = "__finally1"). Using labels in antiquotation is required to convert one basic type to a node of the syntax tree. It is also important for disambiguation since a string can represent a lowercase identifier, but also an an uppercase identifier, an escaped string literal or an escaped character literal. They all have a corresponding label (see the reference manual for details).

An important feature is to keep track in the AST of the location of the original source code. Therefore, a location is associated to each node of the AST. When manipulating the AST with quotations, the quotation expander use a predefined name for the locations. This name is by default loc in the versions of Camlp5 up to 3.08.2 and _loc in the following versions. The best way to avoid trouble is to pass the -loc loc option to camlp5o and use loc. So a location must be available under the name loc when building the AST with quotations. In return, when destructuring the AST with pattern-matching using quotations, a variable named loc is automatically defined. The same thing happens in grammar rules of EXTEND statements, which explains the availability of a loc object which seems to come from nowhere in our EXTEND statement.

Now you should really take the time to understand completely the system of quotations. You must realize that they are used for building nodes of the AST which types are defined in the MLast module of the Camlp5 library. These quotations can also be used in pattern matching if you need to extract some information for an existing AST or if you want to substitute it.

Let's now have a look at the EXTEND statement. What we extend is the default grammar. The default grammar has been set by pa_o.cmo which is implicitely loaded by camlp5o. This is the grammar of the regular syntax of OCaml. We will not extend the set of possible tokens or how they are recognized, but only their meaning according to their sequential arrangement. An EXTEND statement contains a list of grammar entries that will be extended. Each grammar entry consists in a collection of rules. The entries can be predefined or newly defined. They can be made visible and therefore extensible by other syntax extensions or not. Here we just extend the Pcaml.expr entry which defines the syntax of an OCaml expression as its name indicates. A rule within a given entry is made of a pattern (yellow block) which is associated with an OCaml expression which defines a syntax node (grey block):

EXTEND
  Pcaml.expr: LEVEL "expr1" [
    [ "try"; e1 = Pcaml.expr; "finally"; e2 = Pcaml.expr -> expand loc e1 e2 ]
  ];
END;;

The patterns are matched according to precedence levels. Here we know that a level named expr1 exists, and we know its meaning and relative priority with respect to existing syntaxic constructs. We know this from the file pa_o.ml of the Camlp5 library. So we insert a rule exactly in this level, no new level is created which would be the case if we wouldn't have used a LEVEL annotation.

The rest is self explanatory: try and finally are implicitely made reserved keywords of the language if not already, and we extract e1 and e2 which are two expression nodes (Pcaml.expr is a grammar entry which returns objects of type MLast.expr). Our new rule itself must return a node of type MLast.expr. This is the role of our expand function.

After compilation of the syntax extension, we use it to rewrite our program in the regular OCaml syntax:

File prog.ppo:

let _ =
  let __finally1 =
    try failwith "this is not an error"; None with exn -> Some exn
  in
  print_endline "OK";
  match __finally1 with
    None -> ()
  | Some exn -> raise exn

and in the revised syntax, which is closer to what we wrote in the quotations of our file pa_tryfinally.ml:

File prog.ppr:

do {
  let __finally1 =
    try do { failwith "this is not an error"; None } with exn → Some exn
  in
  print_endline "OK";
  match __finally1 with
  [ None → ()
  | Some exn → raise exn ]
};

And the program prog runs as expected:

$ ./prog
OK
Fatal error: exception Failure("this is not an error")

8. Variations around the same example

We will rearrange the source code of our try ... finally syntax extension in order to see better which element is responsible for which effect and learn more about Camlp5.

8.1 Using only one quotation

First, we might use only one quotation to represent the node of the AST which is returned by the try ... finally rule. We are talking of the expand function, which was defined like this:

let expand loc e1 e2 =
  let id = new_id () in
  let id_patt = <:patt< $lid:id$ >> in
  let id_expr = <:expr< $lid:id$ >> in
  <:expr<
  let $id_patt$ =
    try do { $e1$; None } 
    with [ exn -> Some exn ] in
  do { $e2$;
       match $id_expr$ with
           [ None -> ()
           | Some exn -> raise exn ] }
  >>

So, we can inline the definitions of id_patt and id_expr, which here simplifies our source code:

let expand loc e1 e2 =
  let id = new_id () in
  <:expr<
  let $lid:id$ =
    try do { $e1$; None } 
    with [ exn -> Some exn ] in
  do { $e2$;
       match $lid:id$ with
           [ None -> ()
           | Some exn -> raise exn ] }
  >>

The first occurrence in the quotation of our newly created identifier id is a pattern according to the Camlp5 terminology, and the second occurrence is an expression. This is inferred simply by the context: let patt = in the first case and match expr with in the second case.

Here is some dummy OCaml code where some patterns and expressions have been highlighted in yellow (patterns) and pink (expressions):

let x = "abc"
let _ =
  let z2 =
    let z = 5 * 3 in z * z in
  print_float z2;
  match x, z2, Some true with
      "a", _, None -> ()
    | ("abc" | "ab"), 0, Some false -> print_endline "something"
    | _ when z < 10 -> ()
    | _ -> ()

"Toplevel expressions" such as let x = 2 or type t = A | B of string are actually not expressions, but declarations which are elements of the implementation of the current module. In Camlp5, these are called str_item (str reminds of the struct keyword which introduces submodule implementations). There is a quotation expander for str_items, as well as for other families of syntaxic elements that we did not encounter yet.

8.2 Removing the hidden reserved identifier

Let's now remove the uncessary identifier which has a reserved prefix. It avoids the user of our syntax extension to remember that __finally is a forbidden prefix. And there is unfortunately no way of generating identifiers in a reserved Camlp5 namespace.

We completely rewrite the quotation so that all the identifiers we introduce are not accessible by user-defined expressions. Here is one solution:

let expand loc e1 e2 =
  <:expr<
  let f1 () = $e1$
  and f2 () = $e2$ in
  do { (try f1 ()
        with exn -> do { f2 (); raise exn });
       f2 () }
  >>

There are several reasons why we have to write such twisted code:

the closures f1 and f2 play their role well since they both record the environment before any new binding is added (and they don't see each other);
we want to avoid exponential growth in size of the generated code, which could happen if we duplicate the whole contents of the expression e2 (e2 itself might contain try ... finally statements).

And we hope that the compiler handles the closures efficiently.

In general, a good approach would be to implement the initial solution which is more natural, and choose our automatically-generated identifiers so that there is no clash with the user-defined identifiers. There is no simple generic solution for doing this (yet) since it requires a lexical analysis of whole subtrees with a lot of different kind of nodes. We will see an example in which we actually do something like this later.

8.3 Adding more constructs

We want to add another syntax for expressing the same as with try ... finally. We want that the following:

try e1 finally e2

could as well be written as:

before e2 try e1

We will insert a rule for this syntax in the same priority level as we did previously for try ... finally. For now, just notice that we place a vertical bar directly between the rules within in the innermost pair of square brackets which represent the extension of the same level:

EXTEND
  Pcaml.expr: LEVEL "expr1" [
    [ "try"; e1 = Pcaml.expr; "finally"; e2 = Pcaml.expr -> expand loc e1 e2
    | "before"; e2 = Pcaml.expr; "try"; e1 = Pcaml.expr -> expand loc e1 e2 ]
  ];
END;;

Understanding the system of levels is the object of a dedicated section of this tutorial.

9. Local syntax extensions, transformations of the AST

Using the EXTEND statement, we are able to add or replace grammar rules. We have seen that a rule consists in building a syntax node for the OCaml AST. Earlier we defined a rule like this:

   "try"; e1 = Pcaml.expr; "finally"; e2 = Pcaml.expr -> expand loc e1 e2

e1 and e2 are two expressions, i.e. two nodes of type MLast.expr. From these expressions, we build a syntax node which is itself an expression. This is the role of our expand function.

In that case, we don't have to transform e1 or e2. However, things are not always so simple. Let's consider the following problem: we want to create a syntax which has the following properties:

locally we can switch to a special syntax,
this special syntax is a slight modification of the OCaml syntax.

Consider the following problem: in order to make the code for numeric computations easier to read, we want to read ints as floats and their operators (+ - * /) as the equivalent operators over floats (+. -. *. /.). However we don't want this to be applied everywhere in the file, but only in expressions that are introduced by a new FLOAT keyword since it makes quasi impossible to use ints within the new syntax:

let x = FLOAT 3/2 - sqrt (1/3)
let i = 1 + 2 + 3

would be converted into:

let x = 3. /. 2. - sqrt (1. /. 3.)
let i = 1 + 2 + 3

which is less pleasant for the eye.

Using an EXTEND statement, we could relatively easily add rules that interpret int constants as their float equivalent, interpret + as +. and so on (if this is not obvious to you, implement it as an exercise using the knowledge introduced in the previous sections and the file pa_o.ml of the distribution). This would however interpret any occurrence of 2 as the float 2.0 for instance, which is not satisfactory.

On the other hand, we want to benefit from the full OCaml syntax within our "FLOAT" expressions (which by the way do not have to be of type float).

One solution to this problem is to define a quotation expander which uses a globally-modified OCaml syntax. In other words, our example would look like this:

let x = <:float< 3/2 - sqrt (1/3) >>
let i = 1 + 2 + 3

But this is not exactly what we want and I don't know how to do this. Moreover it might be not so simple since we would have to manipulate two variants of OCaml grammars at the same time, not only the default one (maybe a look at the implementation of HereDoc could help).

The solution we will adopt is extremely inelegant, but works after all and does not require as many efforts as it seems at first sight.

For testing our syntax extension, we are going to use this Makefile. We will perform the tests over the following program:

File prog.ml [html]:

let f x =
  FLOAT 
   let pi = acos (-1) in
   x/(2*pi) - x**(2/3)

let _ = 
  let x = 2.5 in
  Printf.printf "%g -> %g\n" x (f x)

And it should be transformed into this:

File expected.ml:

let f x =
  let pi = acos (-1.) in
  x /. (2. *. pi) -. x ** (2. /. 3.)

let _ = 
  let x = 2.5 in
  Printf.printf "%g -> %g\n" x (f x)

Here comes our syntax extension. We use predefined quotations for recursively destructuring the syntax tree, as well as for reconstructing it. Only the yellow regions are actually specific, the rest is very repetitive and can be reused in other programs that need to rewrite expr nodes.

File pa_float.ml [html]:

(* The following function takes an expr syntax node and replaces 
   all occurrences of int constants and operators by their float equivalent.

   The code is directly derived from the section on the quotations 
   for manipulating OCaml syntax trees in the reference manual.

   This code can be easily reused by copy-pasting.
*)
let rec subst_float expr =
  let loc = MLast.loc_of_expr expr in
  let se = subst_float in
  let sel = List.map subst_float in
  let spwel = List.map (fun (p, w, e) -> (p, w, se e)) in
  match expr with
      <:expr< $e1$ . $e2$ >> ->          <:expr< $se e1$ . $se e2$ >>
    | <:expr< $anti:e$ >> ->             <:expr< $anti:se e$ >>
    | <:expr< $e1$ $e2$ >> ->            <:expr< $se e1$ $se e2$ >>
    | <:expr< $e1$ .( $e2$ ) >> ->       <:expr< $se e1$ .( $se e2$ ) >>
    | <:expr< [| $list:el$ |] >> ->      <:expr< [| $list:sel el$ |] >>
    | <:expr< $e1$ := $e2$ >> ->         <:expr< $se e1$ := $se e2$ >>
    | <:expr< $chr:c$ >> ->              expr
    | <:expr< ($e$ :> $t$) >> ->         <:expr< ($se e$ :> $t$) >>
    | <:expr< ($e$ : $t1$ :> $t2$) >> -> <:expr< ($se e$ : $t1$ :> $t2$) >>
    | <:expr< $flo:s$ >> ->              expr
    | <:expr< for $s$ = $e1$ $to:b$ $e2$ do { $list:el$ } >> -> 
          <:expr< for $s$ = $se e1$ $to:b$ $se e2$ do { $list:sel el$ } >>
    | <:expr< fun [ $list:pwel$ ] >> ->  <:expr< fun [ $list:spwel pwel$ ] >>
    | <:expr< if $e1$ then $e2$ else $e3$ >> -> 
        <:expr< if $se e1$ then $se e2$ else $se e3$ >>

    | <:expr< $int:s$ >> -> (* we change the int constants into floats *)
        let x = string_of_float (float (int_of_string s)) in
        <:expr< $flo:x$ >>

    | <:expr< ~ $i$ : $e$ >> ->          <:expr< ~ $i$ : $se e$ >>
    | <:expr< lazy $e$ >> ->             <:expr< lazy $se e$ >>
    | <:expr< let $opt:b$ $list:pel$ in $e$ >> -> 
        let pel' = List.map (fun (p, e) -> (p, se e)) pel in
        <:expr< let $opt:b$ $list:pel'$ in $se e$ >>

    | <:expr< $lid:s$ >> -> (* we override the basic operators + - * / *)
        (match s with
             "+" | "-" | "*" | "/" -> <:expr< $lid: s ^ "."$ >>
           | _ -> expr)

    | <:expr< match $e$ with [ $list:pwel$ ] >> ->
        <:expr< match $se e$ with [ $list:spwel pwel$ ] >> 
    | <:expr< { $list:pel$ } >> -> 
        let pel' = List.map (fun (p, e) -> (p, se e)) pel in
        <:expr< { $list:pel'$ } >>
    | <:expr< do { $list:el$ } >> ->     <:expr< do { $list:sel el$ } >>
    | <:expr< $e1$ .[ $e2$ ] >> ->       <:expr< $se e1$ .[ $se e2$ ] >>
    | <:expr< $str:s$ >> -> expr
    | <:expr< try $e$ with [ $list:pwel$ ] >> -> 
        <:expr< try $e$ with [ $list:spwel pwel$ ] >>
    | <:expr< ( $list:el$ ) >> ->        <:expr< ( $list:sel el$ ) >>
    | <:expr< ( $e$ : $t$ ) >> ->        <:expr< ( $se e$ : $t$ ) >>
    | <:expr< $uid:s$ >> ->              expr
    | <:expr< while $e$ do { $list:el$ } >> -> 
        <:expr< while $se e$ do { $list:sel el$ } >>

    | _ -> 
        Stdpp.raise_with_loc loc 
          (Failure 
             "syntax not supported due to the \
              lack of Camlp5 documentation")

EXTEND
  Pcaml.expr: LEVEL "expr1" [
    [ "FLOAT"; e = Pcaml.expr -> subst_float e ]
  ];
END;;

And the program prog runs nicely:

$ ./prog
2.5 -> -1.44413

You can check the result of the preprocessing of our test program in the standard syntax (prog.ppo) or in the revised syntax (prog.ppr).

Nicer solutions to this kind of problems exist in theory, such as generic tree-traversal functions that could be defined automatically from type definitions. But it has to be written.

10. Inserting some code at the beginning of the file

It may useful to insert some open directives or a few definitions that are used by our runtime system. One solution consists in changing the global function which parses the stream of characters and returns the list of str_items (.ml files) or sig_items (.mli files). This parsing function can be interrupted and reloaded because of directives that might modify the syntax. This is why we must check that the insertions of initial code is made only once.

File prog.ml [html]:

let _ =
  Printf.printf "Version: %s\n" version

File pa_bof.ml [html]:

let insert_this () =
  let loc = Token.dummy_loc in
  (<:str_item< value version = "1.2.3" >>, loc)

let _ =
  let first = ref true in
  let parse strm =
    let (l, stopped) = Grammar.Entry.parse Pcaml.implem strm in
    let l' = 
      if !first then
        insert_this () :: l
      else l in
    (l', stopped) in
  Pcaml.parse_implem := parse

It seems that the pretty-printer is confused by this hack, and the output looks strange:

File prog.ppo:

let version = "1.2.3"let _ = Printf.printf "Version: %s\n" version

Nevertheless, the AST in binary format is correct since the program is correctly compiled and executed when pr_dump.cmo is used (always loaded implicitely by camlp5o) instead of pr_o.cmo:

$ ./prog
Version: 1.2.3

You can also get the Makefile for this example.

11. Inserting toplevel expressions

In the case of expanding the str_item grammar entry with a new rule, often we want to insert several str_item nodes of the OCaml abstract syntax tree, or sometimes not at all. However, we have to return exactly one node.

In this case, we use the declare ... end construct of the revised syntax to group an arbitrary number of str_items:

<:str_item< 
  declare 
    $x$;
    $y$;
  end
>>

Or:

<:str_item< 
  declare 
    $list: list_of_str_items$
  end
>>

See the section on customized record types for a meaningful example.

12. Inserting hidden expressions which are evaluated once

The problem is the following: a syntax extension needs to use some data, such as a cache, that has to be used repeatedly but is initialized only once. Moreover we don't want to expose this definition in the module interface since it will be used transparently and locally.

For instance we can create a count keyword which counts how many times the execution of the program goes through this point, and displays the result when the program terminates:

let f l = 
  print_string "That's a nice list of items:\n";
  List.iter (fun x -> count; print_endline x) l

That could expanded into something like this:

let f =
  let __count1 = ref 0 in
  at_exit 
    (fun () -> 
       Printf.printf 
         "File \"test_count.ml\", line 3, characters 22-26:\n\
          count = %i\n" 
         !__count1);
  fun l ->
    print_string "That's a nice list of items:\n";
    List.iter
      (fun x ->
         incr __count1;
         print_endline x)
      l

Although there is no built-in functionality for doing this, you can use Yutaka Oiwa's Declare_once library which is included in the distribution of his regexp-pp package. Once compiled, the Declare_once module can be used as follows:

let create_some_ast_node some_param =
  ...
  let expr = ... in
  let name_for_my_expr = ... in
  Declare_once.declare 
    ~package:"my_package" 
    name_for_my_expr
    (Declare_once.Expr expr);
  ...

It works by adding a pair (name, expr) to a list of pending declarations. When the value of the current str_item is computed, this list of declarations is inserted in a way which is similar to our example, so that these declarations are not visible in the module interface but are computed only once.

13. Mastering priorities

This section is best illustrated with the pa_o.ml file of the Camlp5 distribution. It is time for you to retrieve it and keep a copy somewhere, if you haven't already done so.

First, look at the expr entry of the grammar. The first occurence of expr: in the file defines what is commented as the "core expressions". It defines many rules, and these rules are grouped into different precedence levels, and many of them are named explicitely: "top", "expr1", ":=", "||", "&&", "apply", "simple", etc. Later in the EXTEND statement, the expr entry is extended further with other rules. Some of these additional rules can be inserted in already existing levels. This extends the "expr1" level of the expr entry with an additional rule:

  expr: LEVEL "expr1"
    [ [ "fun"; p = labeled_patt; e = fun_def -> <:expr< fun $p$ -> $e$ >> ] ]
  ;

The innermost brackets define a level or like here, an extension of an existing level (grey area). A level may contain several rules, separated by vertical bars. Lists of levels also use the vertical bar as a separator, but do not confuse them. Please do not do this:

(* 2 levels, 2 rules, 1 level to extend: 
   Which level is extended?
   Which level is inserted? Where? *)
  entry: LEVEL "some level"
    [ [ some rule ]
    | [ some other rule ] ];

which is different from:

(* extending 1 level with 2 rules: this is clear *)
  entry: LEVEL "some level"
    [ [ some rule
      | some other rule ] ];

Levels have this property: when the parser is looking for a given syntax entry, it starts at a given level (by default the first one) and looks for rules that can be satisfied. If no rule can be satisfied in the current level, it goes to the next level, and repeats the same process. The pratical consequences are that:

different levels can be used to define different levels of associativity such as the addition vs. the multiplication: the level of the addition comes first, and the multiplication comes in the next level (see file pa_o.ml). Viewed like this, addition has a higher priority than multiplication.
Within rules, the level where the parser must start to match a given entry may be specified, so that the rules contained in the preceding levels of the entry are not available (for an example, see the definition of record types with default values in this tutorial).

As stated in the reference manual, only the LEVEL instruction can be used to extend an existing level. Other instructions that specify where a given level must be inserted are available: FIRST, LAST, AFTER some level, BEFORE some level. These positions refer to the order in which they are written, which is the order in which the parser tries to match the rules.

Suggested exercise: implement and test a syntax extension which supports a where construct. For instance,

a + b where a = 1 and b = 2

means

let a = 1 and b = 2 in a + b

We decide that

let a = 1 in a where a = 2

should be read as

let a = 1 in (a where a = 2) (* returns 2 *)

and not

(let a = 1 in a) where a = 2 (* returns 1 *)

Also, the where construct is right-associative:

x + y where x = y where y = 1

means

x + y where x = (y where y = 1) (* depends on an external y *)

and not

(x + y where x = y) where y = 1 (* returns 2 *)

You are encouraged to reuse the let_binding grammar entry (Pcaml.let_binding). Right-associativity must be specified with RIGHTA since the default is left-associativity (LEFTA); you can find examples of these specifications in the pa_o.ml file. After completion of this exercise, you should be able to:

extend a grammar entry with new syntax rules;
give these rules the correct associativity, before even testing.

14. Local use of external parsers

This section gives hints on how to parse some blocks using a custom parser. We will not give too much detail here, since the recommended way of doing this is by using quotations. Make sure you understand the rest of this document before reading this.

14.1 Parsing raw blocks of text

When the language extension that must be parsed locally cannot be parsed using the Camlp5 grammar system, we would normally use quotations. Consider the following example where a graph is represented using ASCII art:

"Node 1"---B---D
   |        \ /
   +---------C

The graph should be expanded into the following type definitions:

type node_1 = [ `B of b | `C of c ]
and b = [ `Node_1 of node_1 | `C of c | `D of d ]
and c = [ `Node_1 of node_1 | `B of b | `D of d ]
and d = [ `B of b | `C of c ]

Actually, this graph should be included in an OCaml program, so we would create a quotation expander named graph, and our piece of program should be written like this:

<:graph<
"Node 1"---B---D
   |        \ /
   +---------C
>>

However, one limitation of quotations is that they must be expanded into either an expr or a patt, but not into a type definition, which is a str_item. So this will not be accepted as-is by the parser.

Solution 1: instead of using of quotation, just create a GRAPH keyword which will be followed by a string literal. This can be expanded into a str_item without specific difficulties, given a function which will parse the string. The problem here is that double-quotes must be protected by backslashes, which may be inconvenient. The program would look like this, which is now totally unreadable unless we don't use double-quoted labels:

GRAPH "
\"Node 1\"---B---D
   |        \ /
   +---------C
"

Solution 2: same as solution 1, but in addition we define a quotation expander named string which just lets us write a string literal using the quotation syntax. In this case, only the >> sequences would have to be protected by backslashes. The example becomes:

GRAPH <:string<
"Node 1"---B---D
   |        \ /
   +---------C
>>

14.2 Parsing the stream of tokens

Now, if the token stream returned by the lexer is satisfying, but your grammar requires to first scan the stream without consuming it, you can do it. You can actually hook any external parser at this point. It will operate on the token stream, with its limitations (whitespace is discarded, tokens may not be recognized the way you want in your sublanguage, ...).

15. Producing useful error messages

The easiest way of generating error messages that indicate a location in the source file is the following:

Stdpp.raise_with_loc _loc (Failure "this is an error message")

It displays the location by indicating file, line number and character offsets, as usual in OCaml. Under Emacs with tuareg-mode it allows to go directly to this location in the source file. However, this raise an exception, which is not always wanted.

A similar error message can be produced using the following function:

open Printf
open Lexing

(* works only if done immediately, since the file name can change when a
   #line or #use directive is encountered *)

let string_of_loc _loc =
  let start, stop = _loc in
  let char1 = start.pos_cnum - start.pos_bol in
  let char2 = char1 + stop.pos_cnum - start.pos_cnum - 1 in
  sprintf "File %S, line %i, characters %i-%i:\n"
    !Pcaml.input_file (* should be: start.pos_fname*) 
    start.pos_lnum
    char1 char2

Beware that there has been bug which caused the pos_fname record field to not be set correctly (bug report 3886). This is why we don't use it, although it should be a better solution since it does not depend on any external state.

Of course, the user of the syntax extension must not load pr_o.cmo (conversion to OCaml source file in standard syntax) when preprocessing a source file with camlp5o, since it does not preserve the original location indicators. The default output format should be used. It is provided by the pr_dump.cmo file which is preloaded in camlp5o or camlp5r. This format is a binary representation of the abstract syntax tree, with locations that match the source code.

See also loc vs. _loc and why you should always use the -loc option when preprocessing a syntax extension.

16. Suggestions for a better interaction between multiple syntax extensions

These are guidelines which should make it easier for programmers to actually use the syntax extensions that you may have written.

16.1 Avoiding strong incompatibilities

If possible, do not override existing rules: this might be fine if only your extension is being used, but if another extension does the same, only one of these extensions can be used at a time. Sometimes, deleting a rule and rewriting an extended version of it is the only way to "extend" existing syntax constructs, but using other keywords instead is always possible.

EXTEND statements are expressions and they can be parametrized by some runtime parameters. It is a good idea to provide an option which allows to specify a given keyword instead of the default one. For instance, instead of this:

(* file pa_eval.ml *)
...

EXTEND
  Pcaml.expr: [
    [ "eval"; e = Pcaml.expr -> ... ]
  ]
END

we would write the following:

(* file pa_eval.ml *)
...

let extend opt =
  let kw = !opt in
  EXTEND
    Pcaml.expr: [
      [ $kw$; e = Pcaml.expr -> ... ]
    ]
  END

let _ =
  let eval = ref "eval" in
  Pcaml.add_option "-eval-kw" 
    (Arg.SetString eval)
    "<kw>  use another keyword than \"eval\"";

Now the users of the syntax extension can load it with camlp5o pa_eval.cmo -eval-kw EVAL if they want the new keyword to be EVAL instead of eval.

16.2 Avoiding name clashes

In order to minimize conflicts between existing syntax extensions that could be used simultaneously, the following rules are suggested:

the name of a library which extends the syntax should start with "pa_" (e.g. pa_eval.cmo)
the name of a library which defines a pretty-printer should start with "pr_" (e.g. pr_eval.cmo)
the name of a library which defines quotation expanders should start with "q_" (e.g. q_eval.cmo)
check that the name of the library that you intend to publish is not already taken, with or without the "pa_", "pr_" or "q_" prefix, unless you are specifically writing syntactic sugar for this library.
if your syntax extension accepts options, beware that other syntax extensions might use the same option names, which won't work when used simultaneously. You can assume the exclusivity of option names that start with the same name as your syntax extension (e.g. -eval-kw).
hidden identifiers that are introduced into the OCaml AST should start two underscores followed by the name of the extension file, including the "pa_" or "pr_" prefix (e.g. __pa_eval1234).

Please note that many existing extensions do not respect all of these (new, unofficial) guidelines, but if you follow them it means less trouble for you in the future.

17. Things you can do

17.1 Catching exceptions only where needed: `let try name = expr1 in expr2 with exception-handler`

Sometimes, it is useful to restrict the scope of an exception handler. The regular try ... with lets us do this:

let rec cat () =
  try 
    let c = input_char stdin in
    print_char c;
    cat ()
  with End_of_file -> ()

let _ = cat ()

but it catches exceptions that might be raised not only during the call to input_char but also print_char and cat itself. That is problematic for several reasons that we don't want to discuss here.

In order to catch the exceptions that are raised during the call to input_char, it can be quite difficult to keep the code simple and readable. Here is one solution which is relatively natural:

let rec cat () =
  match
    try Some (input_char stdin)
    with End_of_file -> None
  with
      Some c -> 
        print_char c;
        cat ()
    | None -> ()

let _ = cat ()

Another solution, which is hard to read but simple to implement mechanically is the following:

let rec cat () =
  (try 
     let c = input_char stdin in
     fun () ->
       print_char c;
       cat ()
   with End_of_file -> fun () -> ()) ()

let _ = cat ()

This is the solution we choose here to implement a new let-try-in-with construct which was suggested by Don Syme in a message to caml-list. It looks like this:

File prog.ml [html]:

let rec cat () =
  let try c = input_char stdin in
  print_char c;
  cat ()
  with End_of_file -> ()

let _ = cat ()

Note that we just inverted the let and try keywords with respect to the original program.

The syntax extension is pretty straightforward and reuses some entries of the grammar of OCaml: Pcaml.let_binding, Pcaml.expr and Pcaml.patt. You can notice that these entries are defined in the Pcaml module, not in Pa_o (file pa_o.ml). The reason is that the grammar for the revised syntax of OCaml (file pa_r.ml) shares the same public entries. This leaves the possibility of writing syntax extensions of the regular syntax (as we do in this tutorial) which also work to extend the revised syntax.

Unfortunately, many entries found in pa_o.ml that we would like to modify are not visible from outside.

In this example, we create a new entry lettry_case which is very similar to the match_case entry found in pa_o.ml:

File pa_lettry.ml [html]:

EXTEND
  GLOBAL: Pcaml.expr;

  Pcaml.expr: LEVEL "expr1" [
    [ "let"; "try"; o = OPT "rec"; l = LIST1 Pcaml.let_binding SEP "and"; 
      "in"; e = Pcaml.expr;
      "with"; pwel = LIST1 lettry_case SEP "|" ->
        <:expr< (try 
                   let $opt: o <> None$ $list:l$ in 
                   fun () -> $e$ 
                 with 
                   [ $list:pwel$ ]) () >>  ]
  ];

  lettry_case: [ 
    [ p = Pcaml.patt; 
      w = OPT [ "when"; e = Pcaml.expr -> e ]; "->"; 
      e = Pcaml.expr -> (p, Ploc.VaVal w, <:expr< fun () -> $e$ >>) ]
  ];
END;;

When a GLOBAL statement is present, it means that any new entry will be created automatically and will not be visible outside of the EXTEND block. To make the lettry_case visible, we would proceed as follows:

let lettry_case = Grammar.Entry.create Pcaml.gram "lettry_case";;

EXTEND
  (* no GLOBAL statement *)
  Pcaml.expr: ... ;
  lettry_case: ... ;
END;;

Our program in the new syntax is successfully transformed into this one:

File prog.ppo:

let rec cat () =
  (try let c = input_char stdin in fun () -> print_char c; cat () with
     End_of_file -> fun () -> ())
    ()

let _ = cat ()

The program prints on stdout the characters read from stdin:

$ echo Hello | ./prog
Hello

Warning: we also should extend the Pcaml.str_item entry, using the same code as for Pcaml.expr, just like for the standard let-in construct found in pa_o.ml.

Alternate syntax: we might prefer a syntax where the with is internal. It makes it easier to realize that the recursive call to our cat function is a tail-call. This was suggested by Daniel de Rauglaudre. It goes like this:

let rec cat () =
  let try c = input_char stdin 
      with End_of_file -> ()
  in
  print_char c;
  cat ()

let _ = cat ()

Implementing this is left as an exercise for the reader.

17.2 Read `1/2` as `1. /. 2.`, but only locally

A full solution to this problem is given earlier, in that section.

17.3 Default values for record fields

Although it is not very easy to extend the existing syntax for type definitions, we can easily add alternative syntaxes.

Here we will create a record keyword that we will use for the definition of records where some fields are defined with default values. A function with labeled arguments will be generated automatically and should be used by the user for creating these records.

This is our test program:

File prog.ml [html]:

record bob = { foo : string = "Hello";
               bar : string;
               mutable n : int = 1 }

record weird = { x : weird option = (Some (create_weird ~x:None ())) }

let _ =
  let x = create_bob ~bar:"World" () in
  x.n <- x.n + 1;
  Printf.printf "%s %s %i\n" x.foo x.bar x.n

There is no big difficulty since we chose not to extend the type syntax for type definitions but to create a new one, just for records.

Note (in pink) that the expressions that are given as default values for record fields are parsed from the "simple" precedence level. It means that unless parentheses are placed around the expression, the semicolon will be interpreted as a separator between two record fields, not between two expressions.

File pa_records.ml [html]:

let make_record_expr loc l =
  let fields =
    List.map (fun ((loc, name, mut, t), default) -> 
                (<:patt< $lid:name$ >>, <:expr< $lid:name$ >>)) l in
  <:expr< { $list:fields$ } >>

let expand_record loc type_name l =
  let type_def = 
    let fields = List.map fst l in
    <:str_item< type $lid:type_name$ = { $list:fields$ } >> in
  let expr_def =
    let record_expr = make_record_expr loc l in
    let f =
      List.fold_right
        (fun ((loc, name, mut, t), default) e ->
           match default with
               None ->
                 <:expr< fun ~ $Ploc.VaVal name$ -> $e$ >>
             | Some x ->
                 <:expr< fun ? ($lid:name$ = $x$) -> $e$ >>)
        l
        <:expr< fun () -> $record_expr$ >> in
    <:str_item< value rec $lid: "create_" ^ type_name$ = $f$ >> in
  <:str_item< declare $type_def$; $expr_def$; end >>

EXTEND
  GLOBAL: Pcaml.str_item;

  Pcaml.str_item: LEVEL "top" [
    [ "record"; type_name = LIDENT; "="; 
      "{"; l = LIST1 field_decl SEP ";"; "}" -> expand_record loc type_name l ]
  ];

  field_decl: [
    [ mut = OPT "mutable";
      name = LIDENT; ":"; t = Pcaml.ctyp; 
      default = OPT [ "="; e = Pcaml.expr LEVEL "simple" -> e ] -> 
        ((loc, name, (mut <> None), t), default) ]
  ];
END;;

Our program prog.ml has been converted into prog.ppo and works as expected:

$ ./prog
Hello World 2

You can download the Makefile.

17.4 Anonymous recursive functions

This is left as an exercise for the reader: we decide that the rec keyword preceding a function makes this function available under the name self throughout its definition. For instance, the following:

List.map 
  (rec function 0 -> 1 | n -> n * self (n - 1))
  [1;2;3;4;5]

would be transcribed into:

List.map 
  (let rec self = function 0 -> 1 | n -> n * self (n - 1) in self)
  [1;2;3;4;5]

Hint: some expressions other than functions can be defined recursively. How would you define the following list in our new syntax?

(* This is a circular list *)
let rec circ = 1 :: 2 :: circ

18. Things you cannot do and workarounds

18.1 Inserting anything, anywhere

Extending the syntax of OCaml consists in adding or replacing rules in the grammar. However the terminal rules, i.e. the tokens returned by the lexer such as LIDENT, STRING or INT, cannot be extended.

Consider the following syntax extension where we create a one keyword which is simply replaced by 1 in expressions and in patterns:

EXTEND
  Pcaml.expr: LEVEL "simple" [ 
    [ "one" -> <:expr< 1 >> ]
  ];
  Pcaml.patt: LEVEL "simple" [ 
    [ "one" -> <:patt< 1 >> ]
  ];
END;;

This will not replace every occurrence of one by 1, but only where one appears as a lowercase identifier as defined by the lexer, as an expression or a pattern. So one_apple and "one + 2" will remain unchanged.

If you need for instance to parametrize the name of an identifier by adding a suffix such as a version number, you can't do it by defining grammar rules. In that case, one solution is to use a simple preprocessor which simply ignores the context, or to define your own quotation expander.

Quotations behave as one single token, which will be expanded into a node of the OCaml syntax tree which is either an expression (expr) or a pattern (patt). Quotations are a good way to introduce a syntax which is radically different from OCaml. All you have to do is define a syntax expander, i.e. a function which builds an expression or a pattern from a raw string. For this you can use any technique you like such as Camlp5 (lexer + grammar), Ocamllex + Ocamlyacc, regular expressions, etc. See the Camlp5 manual for the details on how to define a quotation expander.

18.2 Adding end of line comments

End of lines that separate tokens and comments are eliminated by the lexer. This is why nothing can be done to solve this problem with extensible grammars, although it should be relatively easy to adapt the lexer for this task.

18.3 Adding a notation for raw strings

Adding customized delimiters for string literals cannot be done by extending the grammar.

One alternative is to define a quotation expander which job is to transform the contents of the quotation into a valid OCaml string. In this case, instead of escaping the double-quotes ("), we would have to escape the end-of-quotation delimiters (>>). The code which would be compiled and loaded by the preprocessor should look like this (not tested):

let _ =
  (* we define a very simple quotation expander *)
  let expander is_expr quotation_contents =

    (* addition of double-quotes around the string 
       and backslashes where necessary *)
    let s = Printf.sprintf "%S" quotation_contents in

    (* the result is plain-text OCaml code (concrete syntax) *)
    Quotation.ExStr s in

  Quotation.add "string" expander;
  (* we decide that `string' will be the default quotation expander *)

  Quotation.default := "string"

Now, in a program which is preprocessed with this, the three following notations are equivalent:

  <:string< I don't want to escape this: """""""""" >>
  << I don't want to escape this: """""""""" >>
  " I don't want to escape this: \"\"\"\"\"\"\"\"\"\" "

The syntax expander can also return a node of the AST, but it is more complicated to implement and we lose the location of the quotation, which can make debugging quite unpleasant (again, not tested):

let _ =
  (* we define a very simple quotation expander *)
  let quote_string s = 
    (* no double-quotes around the strings in AST nodes! *)
    String.escaped s in

  let loc = Token.dummy_loc (* avoid doing this whenever you can *) in

  (* here the result is a pair of functions that
     return the appropriate node of the syntax tree (abstract syntax) *)
  let expand_expr quotation_contents =
    let s = quote_string quotation_contents in
    <:expr< $str:s$ >> 

  and expand_patt quotation_contents =
    let s = quote_string quotation_contents in
    <:patt< $str:s$ >> in
    
  let expander = Quotation.ExAst (expand_expr, expand_patt) in
  Quotation.add "string" expander;
  Quotation.default := "string"

18.4 Adding Haskell's "infixing" backquotes such as f `map` list

The backquote symbol (`) is already in use as a prefix operator for constructors of polymorphic variants and in the Camlp5 extension for stream parsers.

Other notations could be used though. Maybe using & is not possible due to priority issues, but we would have something like this:

let add a b = (2 * a) + b
let c = 1 &add 2

which means:

let c = add 1 2

and not:

let c = (add 2) 1

That makes a good exercise for the reader! I don't know if there is an acceptable solution, so let me know if you find one.

Hint: we have to define an infix operator which is accepted by Camlp5 and available (or that can be overriden), and has a stronger precedence than function application ("apply" level) just like . or #.

18.5 Adding SML's #n notation to extract field number n of any tuple

This problem is: how to define a function which returns the nth element of a tuple of any size?

Unfortunately, Camlp5 cannot help much here since it doesn't know the type of the expressions it manipulates.

But if we accept to specify how many fields the records has, it becomes feasible. We would have to define a syntax which would be close to this:

let x = (1, "abc", None)
let third_field = x.3|3

which would mean:

let x = (1, "abc", None)
let third_field = (match x with (_, _, field) -> field)

As often, the difficulty is to find a nice syntax which does not create ambiguities and is accepted by Camlp5.

19. Common problems

19.1 I can't build a list with something like `<:expr< [ $list:my_list$ ] >>`

In the syntax tree, there is a node for each node of a list, and there is no predefined function that will create all these AST nodes automatically.

Let's say we want to create a notation for lists without semicolons between the elements. A program using this notation would look like this:

File prog.ml [html]:

let _ = 
  let a = [| 123; 456 |] in
  List.iter 
    (fun i -> print_int i; 
       print_newline ()) 
    (LIST 1 2 3 a.(1))

The syntax extension is rather short, and easy if you understand the system of levels:

File pa_lists.ml [html]:

let expr_list loc l =
  List.fold_right 
    (fun head tail -> <:expr< [ $head$ :: $tail$ ] >>)
    l
    <:expr< [] >>


EXTEND
  Pcaml.expr: [
    [ "LIST"; l = LIST0 Pcaml.expr LEVEL "." -> expr_list loc l ]
  ];
END;;

As announced, we need to build the nodes of the AST that represent the nodes of the list. This is the purpose of the expr_list function.

The output of the program is the following:

$ ./prog
1
2
3
456

You can also download the following files for this example: the Makefile and the program after conversion to regular OCaml prog.ppo.

19.2 I can't build a function declaration with something like `<:expr< let f $list:args$ = $e$ >>`

Functions as represented in the AST only take one argument. So this:

let f x y z = x + y + z

is represented in the AST as:

let f =
  (fun x ->
     (fun y ->
        (fun z -> x + y + z)))

Such definitions have to be built using higher-order functions such as List.fold_right or List.fold_left (see previous section).

19.3 Incorrect locations in error reports

It happens that Camlp5 returns incorrect locations in errors messages under some circumstances. Camlp5 3.08.1 was particularly difficult to use for this reason, so if you are using OCaml 3.08.1, you should upgrade your OCaml system.

19.4 Unbound value `_loc` (or `loc`)

Between the release of OCaml 3.08.2 and 3.08.3, the default identifier for locations used in syntax extensions silently changed from loc to _loc.

For compatibility reasons, pass the -loc _loc option (or -loc loc) to camlp5o as we did in the Makefiles of this tutorial.

19.5 What's wrong with labels: `<:expr< f ~$lid:labelname$ >>` doesn't work

Labels of function arguments are a special kind of node of the syntax tree which is simply represented using the string type and only include lowercase identifiers. Instead of writing this:

let label = "x" in
<:expr< f ~$lid:label$ >>

one should simply write that:

let label = "x" in
<:expr< f ~$label$ >>

19.6 `Not_found` is raised during the preprocessing

If the Not_found exception is raised during the preprocessing phase (typically while running camlp5o or starting a custom toplevel), the reason may be that a DELETE_RULE statement tries to delete a rule which does not exist. Some rules may be slightly changed from one version of Camlp5 to another or they might move to other grammar entries.

For the sake of compatibility, it seems to be a good practice to catch and ignore any Not_found exception that might be raised by a DELETE_RULE statement, which is simply an expression.

For instance, this will fail with some older versions of Camlp5:

DELETE_RULE Pcaml.patt: LIDENT END

But that should be a much better compromise:

(try DELETE_RULE Pcaml.patt: LIDENT END
 with Not_found -> ())

How to customize the syntax of OCaml, using Camlp5 Everything you always wanted to know, but were afraid to ask

1. What is it about?

2. Is Camlp5 what I need?

3. Intended audience

4. Getting started

5. Which language does Camlp5 speak?

6. What goes in and what comes out

7. Dissection of a syntax expander

8. Variations around the same example

8.1 Using only one quotation

8.2 Removing the hidden reserved identifier

8.3 Adding more constructs

9. Local syntax extensions, transformations of the AST

10. Inserting some code at the beginning of the file

11. Inserting toplevel expressions

12. Inserting hidden expressions which are evaluated once

13. Mastering priorities

14. Local use of external parsers

14.1 Parsing raw blocks of text

14.2 Parsing the stream of tokens

15. Producing useful error messages

16. Suggestions for a better interaction between multiple syntax extensions

16.1 Avoiding strong incompatibilities

16.2 Avoiding name clashes

17. Things you can do

17.1 Catching exceptions only where needed: let try name = expr1 in expr2 with exception-handler

17.2 Read 1/2 as 1. /. 2., but only locally

17.3 Default values for record fields

17.4 Anonymous recursive functions

18. Things you cannot do and workarounds

18.1 Inserting anything, anywhere

18.2 Adding end of line comments

18.3 Adding a notation for raw strings

18.4 Adding Haskell's "infixing" backquotes such as f `map` list

18.5 Adding SML's #n notation to extract field number n of any tuple

19. Common problems

19.1 I can't build a list with something like <:expr< [ $list:my_list$ ] >>

19.2 I can't build a function declaration with something like <:expr< let f $list:args$ = $e$ >>

19.3 Incorrect locations in error reports

19.4 Unbound value _loc (or loc)

19.5 What's wrong with labels: <:expr< f ~$lid:labelname$ >> doesn't work

19.6 Not_found is raised during the preprocessing

How to customize the syntax of OCaml, using Camlp5
Everything you always wanted to know, but were afraid to ask

17.1 Catching exceptions only where needed: `let try name = expr1 in expr2 with exception-handler`

17.2 Read `1/2` as `1. /. 2.`, but only locally

19.1 I can't build a list with something like `<:expr< [ $list:my_list$ ] >>`

19.2 I can't build a function declaration with something like `<:expr< let f $list:args$ = $e$ >>`

19.4 Unbound value `_loc` (or `loc`)

19.5 What's wrong with labels: `<:expr< f ~$lid:labelname$ >>` doesn't work

19.6 `Not_found` is raised during the preprocessing