This is a collection of notes concerning debatable aspects of the design of the Micmatch system. Use the wiki if you want to share your views.
The general syntax for regular expressions is based on what is already in use in ocamllex. Regular expressions are often not so simple, and representing them as a compact string which is full of backslashes (\) is definitely not user-friendly. With our syntax, the different tokens of the regexp can be highlighted by the text editor, and there is no need to know a list of "special characters". Special characters are simply not mixed with characters that we want to match.
I would like to keep the current uppercase convention for all the newly created alphabetic keywords because it makes it clear that they are part of a syntax extension, not regular identifiers of the core OCaml language. This should make things easier for people who don't know by heart all the keywords of either OCaml or Micmatch.
match "abc def" with RE graph+ as x -> x _ -> "???"
In the previous example, it is easy to notice that RE
is
a new keyword which introduces a section of code with a special
syntax.
Keyword as
uses lowercase characters since
it already exists as a keyword in regular OCaml patterns and
in ocamllex regexps with an analogous meaning.
However the sandwich form for regexp patterns sometimes looks nicer than
the RE
notation. It is supported since version 0.689
and is strictly equivalent:
match "abc def" with / graph+ as x / -> x _ -> "???"
Another syntax in the same style as stream parsers could have been chosen:
match "abc def" with [/ graph+ as x /] -> x _ -> "???"
Advantages of / ... / : easier to type; lighter for short regexps.
Advantages of [/ ... /] : easier to read, especially with a text-editor which
detects matching brackets (but we can always add parentheses
to make things clear); avoids the use of parentheses in some rare cases;
more consistent with the existing OCaml syntax and other syntax extensions.
Question: what do people prefer? If you have a preference, please let me know. We might switch to the [/ ... /] notation but we can't keep both because putting a space between the [ and the / would have a different meaning, which would be disastrous for code readability.
The various "macros" which are available in the PCRE variant of Micmatch all use uppercase characters. Maybe this looks not very beautiful for syntaxic constructs which are much more subtle than C macros, but once again it avoids confusion for the new users and has few chances to interfere with other library functions:
# let split_list = SPLIT space* "," space* (* SPLIT keyword *) ;; val split_list : ?full:bool -> ?pos:int -> string -> string list = <fun> # split_list "a, b, cde";; - : string list = ["a"; "b"; "cde"] # Pcre.split (* good, lowercase split is not a keyword *);; - : ?iflags:Pcre.irflag -> ?flags:Pcre.rflag list -> ?rex:Pcre.regexp -> ?pat:string -> ?pos:int -> ?max:int -> ?callout:Pcre.callout -> string -> string list = <fun>
Finally, the Lazy
and Possessive
annotations
that can be found in regexps play the role of keyword withing regexps
but technically are not keywords since you can use them normally outside
of regexps. So the Lazy
module of the standard library of OCaml
can still be used without any problem.
In ocamllex patterns, if a binding does not appear on each side of an
alternative (),
then the identifier is associated with a value of
type string option. In the following, the identifier
greeting
can have 3 possible values:
Some "Hello"
, Some "hello"
or
None
:
['H''h']"ello" as greeting "" { greeting }
In Micmatch, this is an invalid pattern since each binding must
occur on both sides of an alternative. This follows the behavior of
the regular pattern-matching of OCaml. It is not a real problem since
in this case the empty string ""
is equivalent to
None
(matching the empty string always succeeds):
match read_line () with RE ['H''h']"ello" as greeting ("" as greeting) -> greeting
Which would be preferably written as:
match read_line () with RE (['H''h']"ello" "") as greeting -> greeting
Or just:
match read_line () with RE (['H''h']"ello")? as greeting -> greeting
Currently, support is provided for 2 regexp libraries: Str and PCRE. First, it shows that other libraries could be used in the future with a minimum of efforts since a large part of the implementation of Micmatch is already shared between the 2 variants. Second, it shows that Micmatch is just a layer over existing regexp libraries.
PCRE is a popular library which provides many useful and documented features. There was an existing interface for OCaml (PCRE-OCaml) so this why a variant of Micmatch which uses PCRE has been implemented and is now preferred over the other one.
Micmatch has been used with great satisfaction (at least) by its author, for reading various data files in bioinformatics (writing parsers for these files does not take more time than understanding their format).
Thus, the version 1.0 of Micmatch will be released soon, hopefully, and should not be much different from the current version. Subsequent versions (1.x) will remain backward compatible with the specifications of the reference manual 1.0.
Micmatch is implemented as a syntax extension of Objective Caml using the Camlp4 technology. It can be used with the regular syntax of OCaml and in theory also with the revised syntax although it has not been tested.
In theory it should play well with other syntax extensions as long as
they don't redefine the match ... with
,
try ... with
and function ... ->
constructs since they are overwritten by Micmatch.
In the worst case when there would be an incompatibility between Micmatch and another syntax extension, it is still possible to isolate the Micmatch-dependent code in a separate file and preprocess only this file with Micmatch.