| Ocamllex Tutorial | ||
|---|---|---|
| <<< Previous | Next >>> | |
The patterns in the input are written using regular expressions in the style of lex, with a more Caml-like syntax. These are:
'c'
match the character 'c'. The character constant is the same syntax as Objective Caml character.
_
(underscore) match any character.
eof
match an end-of-file .
"foo"
the literal string "foo". The syntax is the same syntax as Objective Caml string constants.
['x' 'y' 'z']
character set; in this case, the pattern matches either an 'x', a 'y', or a 'z' .
['a' 'b' 'j'-'o' 'Z']
character set with a range in it; ranges of characters 'c1' - 'c2' (all characters between c1 and c2, inclusive); in this case, the pattern matches an `a', a `b', any letter from `j' through `o', or a `Z'.
[^ 'A'-'Z']
a "negated character set", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
[^ 'A'-'Z' '\n']
any character EXCEPT an uppercase letter or a newline
r*
zero or more r's, where r is any regular expression
r+
one or more r's, where r is any regular expression
r?
zero or one r's, where r is any regular expression (that is, "an optional r")
ident
the expansion of the "ident" defined by an earlier let ident = regexp definition.
(r)
match an r; parentheses are used to override precedence (see below)
rs
the regular expression r followed by the regular expression s; called "concatenation"
r|s
either an r or an s
r#s
match the difference of the two specified character sets.
r as ident
bind the string matched by r to identifier ident.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom; '*' and '+' have highest precedence, followed by '?', 'concatenation', '|', and then 'as'. For example,
"foo" | "bar"* |
is the same as
("foo")|("bar"*)
|
since the '*' operator has higher precedence than than alternation ('|'). This pattern therefore matches either the string "foo" or zero-or-more of the string "bar".
To match zero-or-more "foo"'s-or-"bar"'s:
("foo"|"bar")*
|
A negated character set such as the example "[^ 'A'-'Z']" above will match a newline unless "\n" (or an equivalent escape sequence) is one of the characters explicitly present in the negated character set (e.g., "[^ 'A'-'Z' '\n']"). This is unlike how many other regular expression tools treat negated character set, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.
| <<< Previous | Home | Next >>> |
| Format of the input file | How the input is matched |