Main Software Downloads Other

Micmatch - Design Notes

This is a collection of notes concerning debatable aspects of the design of the Micmatch system. Use the wiki if you want to share your views.

Contents
1. Syntax
1.1 General syntax of regexps
1.2 Uppercase keywords
1.2.1 "match" patterns
1.2.2 Name of the macros
1.3 Type of named subgroups
2. Choice of regexp libraries
2.1 Diversity
2.2 PCRE
3. Compatibility issues
3.1 Versions
3.2 Cohabitation with other syntax extensions

1. Syntax

1.1 General syntax of regexps

The general syntax for regular expressions is based on what is already in use in ocamllex. Regular expressions are often not so simple, and representing them as a compact string which is full of backslashes (\) is definitely not user-friendly. With our syntax, the different tokens of the regexp can be highlighted by the text editor, and there is no need to know a list of "special characters". Special characters are simply not mixed with characters that we want to match.

1.2 Uppercase keywords

I would like to keep the current uppercase convention for all the newly created alphabetic keywords because it makes it clear that they are part of a syntax extension, not regular identifiers of the core OCaml language. This should make things easier for people who don't know by heart all the keywords of either OCaml or Micmatch.

1.2.1 "match" patterns

match "abc def" with
    RE graph+ as x -> x
  | _ -> "???"

In the previous example, it is easy to notice that RE is a new keyword which introduces a section of code with a special syntax. Keyword as uses lowercase characters since it already exists as a keyword in regular OCaml patterns and in ocamllex regexps with an analogous meaning.

However the sandwich form for regexp patterns sometimes looks nicer than the RE notation. It is supported since version 0.689 and is strictly equivalent:

match "abc def" with
    / graph+ as x / -> x
  | _ -> "???"

Another syntax in the same style as stream parsers could have been chosen:

match "abc def" with
    [/ graph+ as x /] -> x
  | _ -> "???"

Advantages of / ... / : easier to type; lighter for short regexps.
Advantages of [/ ... /] : easier to read, especially with a text-editor which detects matching brackets (but we can always add parentheses to make things clear); avoids the use of parentheses in some rare cases; more consistent with the existing OCaml syntax and other syntax extensions.

Question: what do people prefer? If you have a preference, please let me know. We might switch to the [/ ... /] notation but we can't keep both because putting a space between the [ and the / would have a different meaning, which would be disastrous for code readability.

1.2.2 Name of the macros

The various "macros" which are available in the PCRE variant of Micmatch all use uppercase characters. Maybe this looks not very beautiful for syntaxic constructs which are much more subtle than C macros, but once again it avoids confusion for the new users and has few chances to interfere with other library functions:

# let split_list = SPLIT space* "," space* (* SPLIT keyword *) ;;
val split_list : ?full:bool -> ?pos:int -> string -> string list = <fun>
# split_list "a, b, cde";;
- : string list = ["a"; "b"; "cde"]
# Pcre.split (* good, lowercase split is not a keyword *);;
- : ?iflags:Pcre.irflag ->
    ?flags:Pcre.rflag list ->
    ?rex:Pcre.regexp ->
    ?pat:string ->
    ?pos:int -> ?max:int -> ?callout:Pcre.callout -> string -> string
    list
= <fun>

Finally, the Lazy and Possessive annotations that can be found in regexps play the role of keyword withing regexps but technically are not keywords since you can use them normally outside of regexps. So the Lazy module of the standard library of OCaml can still be used without any problem.

1.3 Type of named subgroups

In ocamllex patterns, if a binding does not appear on each side of an alternative (|), then the identifier is associated with a value of type string option. In the following, the identifier greeting can have 3 possible values: Some "Hello", Some "hello" or None:

['H''h']"ello" as greeting | ""    { greeting }

In Micmatch, this is an invalid pattern since each binding must occur on both sides of an alternative. This follows the behavior of the regular pattern-matching of OCaml. It is not a real problem since in this case the empty string "" is equivalent to None (matching the empty string always succeeds):

match read_line () with
    RE ['H''h']"ello" as greeting | ("" as greeting) -> greeting

Which would be preferably written as:

match read_line () with
    RE (['H''h']"ello" | "") as greeting -> greeting

Or just:

match read_line () with
    RE (['H''h']"ello")? as greeting -> greeting

2. Choice of regexp libraries

2.1 Diversity

Currently, support is provided for 2 regexp libraries: Str and PCRE. First, it shows that other libraries could be used in the future with a minimum of efforts since a large part of the implementation of Micmatch is already shared between the 2 variants. Second, it shows that Micmatch is just a layer over existing regexp libraries.

2.2 PCRE

PCRE is a popular library which provides many useful and documented features. There was an existing interface for OCaml (PCRE-OCaml) so this why a variant of Micmatch which uses PCRE has been implemented and is now preferred over the other one.

3. Compatibility issues

3.1 Versions

Micmatch has been used with great satisfaction (at least) by its author, for reading various data files in bioinformatics (writing parsers for these files does not take more time than understanding their format).

Thus, the version 1.0 of Micmatch will be released soon, hopefully, and should not be much different from the current version. Subsequent versions (1.x) will remain backward compatible with the specifications of the reference manual 1.0.

3.2 Cohabitation with other syntax extensions

Micmatch is implemented as a syntax extension of Objective Caml using the Camlp4 technology. It can be used with the regular syntax of OCaml and in theory also with the revised syntax although it has not been tested.

In theory it should play well with other syntax extensions as long as they don't redefine the match ... with, try ... with and function ... -> constructs since they are overwritten by Micmatch.

In the worst case when there would be an incompatibility between Micmatch and another syntax extension, it is still possible to isolate the Micmatch-dependent code in a separate file and preprocess only this file with Micmatch.