Mikmatch Version 1.0.5
Reference Manual

Martin Jambon

October 21, 2011

This manual is available online as a single HTML file at
http://mjambon.com/mikmatch-manual.html
and as a PDF document at
http://mjambon.com/mikmatch-manual.pdf.
The home page of Mikmatch is:
http://mjambon.com/micmatch.html

Contents

1  Introduction

Mikmatch is an extension of the syntax of the Objective Caml programming language (OCaml). Its purpose it to make the use of regular expressions easier and more generally to provide a set of tools for using OCaml as a powerful scripting language. Mikmatch believes that regular expressions are just like any other program and deserve better than a cryptic sequence of symbols placed in a string of a master program.

Mikmatch currently supports two different libraries that implement regular expressions: Str which comes with the original distribution of OCaml and PCRE-OCaml which is an interface to PCRE (Perl Compatible Regular Expressions) for OCaml. These two flavors will be referred as Mikmatch_str and Mikmatch_pcre. They share a large number of syntaxic features, but Mikmatch_pcre provides several macros that cannot be implemented safely in Mikmatch_str. Therefore, it is recommended to use Mikmatch_pcre.

2  Language

2.1  Regular expressions

2.1.1  Grammar of the regular expressions

Regular expressions support the syntax of Ocamllex regular expressions as of version 3.08.1 of the Objective Caml system (http://caml.inria.fr/pub/docs/manual-ocaml/), and several additional features. A regular expression (regexp) is defined by the grammar that follows. The associativity rules are given by priority levels. 0 is the strongest priority.

2.1.2  Named regular expressions

Naming regular expressions is possible using the following toplevel construct:
RE ident = regexp
where ident is a lowercase identifier. Regular expressions share their own namespace.

For instance, we can define a phone number as a sequence of 3 digits followed by a dash and followed by 4 digits:

RE digit = ['0'-'9']
RE phone = digit{3} '-' digit{4}

2.1.3  Predefined sets of characters

The POSIX character classes (sets of characters) are available as predefined regular expressions of length 1. Their definition is given in table 1.


Table 1: POSIX character classes and their definition in the Mikmatch syntax
RE lower = ['a'-'z']
RE upper = ['A'-'Z']
RE alpha = lower | upper
RE digit = ['0'-'9']
RE alnum = alpha | digit
RE punct = ["!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"]
RE graph = alnum | punct
RE print = graph | ’ ’
RE blank = ' ' | '\t'
RE cntrl = ['\x00'-'\x1F' '\x7F']
RE xdigit = [digit 'a'-'f' 'A'-'F']
RE space = [blank "\n\x0B\x0C\r"]

2.1.4  More predefined patterns

Some named regexps are predefined and available in every implementation of Mikmatch. These are the following:


Table 2: Predefined regexps in Mikmatch
RE int = ["-+"]? ( "0" ( ["xX"] xdigit+
                       | ["oO"] ['0'-'7']+
                       | ["bB"] ["01"]+ )
                 | digit+ )

RE float = 
  ["-+"]? 
     ( ( digit+ ("." digit* )? | "." digit+ ) (["eE"] ["+-"]? digit+ )?
       | "nan"~ 
       | "inf"~ )

2.2  General pattern matching

2.2.1  Regexps and match/function/try constructs

In Mikmatch, regular expressions can be used to match strings instead of the regular patterns. In this case, the regular expression must be preceded by the RE keyword, or placed between slashes (//). Both notations are equivalent.

Only the following constructs support patterns that contain regular expressions:

Examples:

let is_num = function RE ['0'-'9']+ -> true | _ -> false

let get_option () =
  match Sys.argv with
      [| _ |] -> None 
    | [| _; RE (['a'-'z']+ as key) "=" (_* as data) |] -> Some (key, data)
    | _ -> failwith "Usage: myprog [key=value]"

let option =
  try get_option ()
  with Failure RE "usage"~ -> None

If alternatives are used in a pattern, then both alternatives must define the same set of identifiers. In the following example, the string code can either come from the normal pattern matching or be a fresh substring which was extracted using the regular expression:

match option, s with
    Some code, _
  | None, RE _* "=" (['A'-'Z']['0'-'9'] as code) -> print_endline code

  | _ -> ()

In the general case, it is not possible to check in advance if the pattern-matching cases are complete if at least one of the patterns is a regular expression. In this case, no warnings against missing cases are displayed, thus it is safer to either add a catch-all case like in the previous examples or to catch the Match_failure exception that can be raised unexpectedly.

2.2.2  Views (experimental feature)

Views are a general form of symbolic patterns other than those authorized by the concrete structure of data. For example, Positive could be a view for positive ints. View patterns can also bind variables and a useful example in OCaml is pattern-matching over lazy values.

Here we propose simple views, as suggested by Simon Peyton Jones for Haskell:
http://hackage.haskell.org/trac/ghc/wiki/ViewPatterns. We propose a different syntax, but note that the syntax that we have chosen here is experimental and may change slightly in future releases.

2.2.2.1  View patterns

A view pattern has one of these two forms:

  1. % view-name: a view without an argument. It is a simple check over the subject data.
  2. % view-name pattern: a view with an argument, the pattern. If the view function matches successfully, its result is matched against the given pattern.

where a view-name is a capitalized alphanumeric identifier, possibly preceded by a module path specification, e.g. Name or Module.Name.

2.2.2.2  Definition of a view

Views without arguments are defined as functions of type ’a -> bool, while views with arguments are defined as functions of type ’a -> ’b option.

The syntax for defining a view is:

Using the syntax above is however not strictly needed, since it just defines a function named after the name of the view, and prefixed by view_. For instance let view X = f can be written as let view_X = f in regular OCaml. Therefore, some library modules can export view definitions without using any syntax extension themselves.

2.2.2.3  Example
(* The type of lazy lists *)
type 'a lazy_list = Nil | Cons of ('a * 'a lazy_list lazy_t)

(* Definition of a view without argument for the empty list *)
let view Nil = 
    fun l ->
      try Lazy.force l = Nil
      with _ -> false

(* Independent definition of a view with an argument, 
   the head and tail of the list *)
let view Cons = 
    fun l -> 
      try
        match Lazy.force l with 
            Cons x -> Some x 
          | Nil -> None
      with _ -> None


(* Test *)
let _ =
  let l = lazy (Cons (1, lazy (Cons (2, lazy Nil)))) in
  match l with
      %Nil
    | %Cons (_, %Nil) -> assert false
    | %Cons (x1, %Cons (x2, %Nil)) ->
        assert (x1 = 1);
        assert (x2 = 2);
        Printf.printf "Passed view test\n%!"
    | _ -> assert false
2.2.2.4  Limitations

Each time a value is tested against a view pattern, the corresponding function is called. There is no optimization that would avoid calling the view function twice on the same argument.

Redundant or missing cases cannot be checked, just like when there is a regexp in a pattern. This is due both to our definition of views and to the implementation that we get using Camlp4.

2.3  Shortcut for one-case regexp matching

A shortcut notation can be used to extract substrings from a string that match a pattern which is known in advance:
let /regexp/ = expr in expr

Global declarations also support this shortcut:
let /regexp/ = expr
Example [toplevel support not available in Camlp4 3.10]:

# Sys.ocaml_version;;
- : string = "3.08.3"
# RE int = digit+;;
# let /(int as major : int) "." (int as minor : int) 
       ("." (int as patchlevel) | ("" as patchlevel))
       ("+" (_* as additional_info) | ("" as additional_info))/ = 
    Sys.ocaml_version
;;
val additional_info : string = ""
val major : int = 3
val minor : int = 8
val patchlevel : string = "3"

The notation does not allow simultaneous definitions using the and keyword nor recursive definitions using rec.

As usual, the Match_failure exception is raised if the string fails to match the pattern. The let-try-in-with construct described in the next section also supports regexp patterns, with the same restrictions.

2.4  The let-try-in-with construct

A general notation for catching exceptions that are raised during the definition of bindings is provided:
let try [rec] let-binding {and let-binding} in
expr
with pattern-matching

It has the same meaning as:
try let [rec] let-binding {and let-binding} in
expr
with pattern-matching
except that in the former case only the exceptions raised by the let-bindings are handled by the exception handler introduced by with.

2.5  Implementation-dependent features

These features depend on which library is actually used internally for manipulating regular expressions. Currently two libraries are supported: the Str library from the official OCaml distribution and the PCRE-OCaml library. Support for other libraries might be added in the future.

2.5.1  Backreferences

Previously matched substrings can be matched again using backreferences. !ident is a backreference to the named group ident that is defined previously in the sequence. During the matching process, it is not possible that a backreference refers to a named group which is not matched. In the following example, we extract the repeated pattern abc from abcabc [toplevel support not available in Camlp4 3.10]:

# match "abcabc" with RE _* as x !x -> x;;
- : string = "abc"

2.5.2  Specificities of Mikmatch_str

Backreferences as described previously (section 2.5.1) are supported.

In addition to the POSIX character classes, a set of predefined patterns is available:

2.5.3  Specificities of Mikmatch_pcre

This is currently the version which is used by the mikmatch command.

2.5.3.1  Matching order

Alternatives (regexp1|regexp2) are tried from left to right.

The quantifiers (*, +, ? and {}) are greedy except if specified otherwise (see next paragraph). The regular expressions are matched from left to right, and the repeated patterns are matched as many times as possible before trying to match the rest of the regular expression and either succeed or give up one repetition before retrying (backtracking).

2.5.3.2  Greediness and laziness

Normally, quantifiers (*, +, ? and {}) are greedy, i.e. they perform the longest match in terms of number of repetitions before matching the rest of the regular expression or backtracking. The opposite behavior is laziness: in that case, the number of repetitions is made minimal before trying to match the rest of the regular expression and either succeed or continue with one more repetition.

The lazy behavior is turned on by placing the keyword Lazy after the quantifier. This is the equivalent of Perl’s quantifiers *?, +?, ?? and {}?. For instance, compare the following behaviors [toplevel support not available in Camlp4 3.10]:

# match "<hello><world>" with RE "<" (_* as contents) ">" -> contents;;
- : string = "hello><world"
# match "<hello><world>" with RE "<" (_* Lazy as contents) ">" -> contents;;
- : string = "hello"
2.5.3.3  Possessiveness or atomic grouping

Sometimes it can be useful to prevent backtracking. This is achieved by placing the Possessive keyword after a given group. For instance, compare the following [toplevel support not available in Camlp4 3.10]:

# match "abc" with RE _* _ -> true | _ -> false;;  
- : bool = true
# match "abc" with RE _* Possessive _ -> true | _ -> false;;
- : bool = false

This operator has the strongest associativity priority (0), just like the quantifiers.

2.5.3.4  Backreferences

Backreferences are supported as described in section 2.5.1.

2.5.3.5  Predefined patterns

The following predefined patterns are available in addition to the POSIX character classes:

2.5.3.6  Lookaround assertions

A lookaround assertion is a pattern that has to be matched but doesn’t consume characters in the string being matched.

Lookahead assertions are checked after the current position in the string, and lookbehind assertions are matched before the current point. The general syntax for an assertion is the following:
< lookbehind . lookahead >
< lookahead >
The central dot symbolizes the current position. The lookbehind assertion is a test over the characters at the left of the current point, while the lookahead is a test over the characters at the right of the current point in the string.

lookbehind or lookahead are either empty or a regular expression, optionally preceded by Not. An assertion starting with Not is called negative and means that the given regular expression can not match here.

There are no restrictions on the contents of lookahead regular expressions. Lookbehind regular expressions are restricted to those that match substrings of length that can be predetermined. Besides this, backreferences are not supported in lookbehind expressions.

2.5.3.7  Macros

This implementation provides a set of macros that follow this syntax:
MACRO-NAME regexp -> expr
where expr is the expression that will be computed every time the pattern given by regexp is matched.

Only the SPLIT and FILTER macros follows a simplified syntax:
MACRO-NAME regexp

These constructs build a function which accepts some optional arguments and the string to match. For instance,
(REPLACE "," -> ";") "a,b,c"
returns "a;b;c" whereas
(REPLACE "," -> ";") ~pos:2 "a,b,c"
returns "a,b;c"

The possible options are the following:

MATCH regexp -> expr
tries to match the pattern regexp at the beginning of the string or at the given position pos and returns expr or raise Not_found. Options: pos (0), share (false). When pos and share are not specified, it is equivalent to:

function 
    RE regexp -> expr
  | _ -> raise Not_found

REPLACE regexp -> expr
returns a string in which every occurrence of the pattern is replaced by expr. Options: pos (0).

REPLACE_FIRST regexp -> expr
returns a string in which the first occurrence of the pattern is replaced by expr. A copy of the input string is returned if the pattern is not found. Options: pos (0).

SEARCH regexp -> expr
simply evaluates expr every time the pattern is matched. Options: pos (0).

SEARCH_FIRST regexp -> expr
simply evaluates expr the first time the pattern is matched and returns the result. Exception Not_Found is raised if the pattern is not matched. Options: pos (0), share (false).

COLLECT regexp -> expr
evaluates expr every time the pattern is matched and puts the result into a list. Options: pos (0).

COLLECTOBJ regexp
like COLLECT, but the elements of the returned list are automatically objects with methods that correspond to the subgroups captured with as. Options: pos (0).

SPLIT regexp
splits the given string using regexp as a delimiter. Options: pos (0), full (false).

FILTER regexp
creates a predicate that returns true is the given string matches regexp or false otherwise. Options: pos (0), share (false).

CAPTURE regexp
returns Some o where o is an object with methods that correspond to the captured subgroups, or None if the subject string doesn’t match regexp. Options: pos (0), share (false).

MAP regexp -> expr
splits the given string into fragments: the fragments that do not match the pattern are returned as ‘Text s where s is a string. Fragments that match the pattern are replaced by the result of expr, which has to be a polymorphic variant. Options: pos (0), full (true). For instance,
(MAP ',' -> `Sep) "a,b,c,"
returns the list
[`Text "a"; `Sep; `Text "b"; `Sep; `Text "c"; `Sep; `Text ""]
whereas
(MAP ',' -> `Sep) ~full:false "a,b,c,"
returns only
[`Text "a"; `Sep; `Text "b"; `Sep; `Text "c"; `Sep]

3  Tools

3.1  Micmatch, Mikmatch, old Camlp4, new Camlp4, Camlp5

Camlp4/Camlp5 is the set of tools that allows to build and use syntax extensions of OCaml. We distinguish 3 major variants of Camlp4:

Micmatch is the name of the original implementation of Mikmatch for the old Camlp4:

3.2  The toplevel

[toplevel support not available in Camlp4 3.10]

3.3  The libraries for the preprocessor

3.3.1  Mikmatch_str

The preprocessing library pa_mikmatch_str.cma must be loaded by the preprocessor (camlp4o or camlp4r).

It is safe to use Mikmatch_str in multithreaded programs without locks only if the patterns do not contain the @ keyword because it uses a shared cache of compiled regexps.

3.3.2  Mikmatch_pcre

The preprocessing library pa_mikmatch_pcre.cma must be loaded by the preprocessor (camlp4o or camlp4r).

It is safe to use Mikmatch_str in multithreaded programs without locks only if the patterns do not contain the @ keyword because it uses a shared cache of compiled regexps.

3.4  The runtime libraries

Both variants depend on portable features of the Unix library. The executables must therefore be linked against unix.cma (bytecode) or unix.cmxa (native code) in addition to the specific libraries mentioned below.

3.4.1  Mikmatch_str

In addition to the backend for the regular expressions engine (str.cma for bytecode or str.cmxa for native code), the OCaml code which is produced by the preprocessor needs to be linked against either run_mikmatch_str.cma (bytecode), run_mikmatch_str.cmxa (native code), run_mikmatch_str_mt.cma (bytecode, threads) or run_mikmatch_str_mt.cmxa (native code, threads).

3.4.2  Mikmatch_pcre

In addition to the backend for the regular expressions engine (pcre.cma for bytecode or pcre.cmxa for native code), the OCaml code which is produced by the preprocessor needs to be linked against either run_mikmatch_pcre.cma (bytecode), run_mikmatch_pcre.cmxa (native code). Multithreaded programs are supported as well and do not require a specific library.

4  A small text-oriented library

Module Mikmatch


This document was translated from LATEX by HEVEA.