Wildcards and Regular Expressions

Wildcards and regular expressions provide useful ways to search for text, allowing parameters inside the search string itself.

 

Terminology

To explain these strings, a few terms must be defined.

Term Meaning
Control String The string from the file or RAM that will be tested for a match.
Test String The string entered by the user that will be used in the search to find all matches. The wildcard an regular-expression strings are test strings.
Identifier Used in regular expressions, these are simply whole words that are composed only of alpha characters A to Z and a to z.

 

Wildcards

Wildcard strings are similar to normal strings but with two characters that have special meaning.

  • ? — This causes any single character from the control string to be accepted. For example, the wildcard string B?b will cause matches with Bob, Brb, and Bbb, but not Bbab, because only one character is used to match with the ?.
  • * — Matches any number of characters until the next character in the wildcard string is found in the control string. For example, the wildcard string B*b will cause matches with Bob, Brb, Blab, Blaster-nab, etc. If the * character is used at the end of the wildcard string, the number of characters matched depends on the settings for wildcard match lengths.

 

Regular Expressions

Regular expressions are a widely recognized way of describing string patterns. MHS uses a method of parsing regular expressions called deterministic finite automation. This means the regular expressions are not parsed on-the-fly as they are with the Windows® API, but instead are compiled to an internal form which makes scanning exceptionally fast. To make the explanation of regular expressions more intuitive, first the elements of regular expressions will be explained, followed by various examples that use the explained elements.

Elements of regular expressions are sets of characters with special meaning. Regular-expressions strings are strings that are made of regular-expression elements. The following chart explains the most basic elements of regular expressions.

Element Description Example
Individual Characters

Individual characters can be elements of a regular expression.

Some individual characters have special meaning to the regular-expression parser, so to use these characters without using their special meanings, put a \ in front of them.

The characters with special meanings are:

  • *
  • +
  • ?
  • .
  • |
  • [
  • ]
  • (
  • )
  • -
  • $
  • ^
  • (space)

To use the space as an individual character, use “\ ”.

To use non-printable characters, either use the correct escape sequence from C (\n, \t, \r, or \v), or use \xhh where h is a hexadecimal digit.

  • H” — In the string, “Here I am!”, the H at the beginning of the string would be matched.
  • you” — 3 characters. In the string, “Where are you going?”, the you would be matched.
  • \$50\.00” — $ and . are special characters, so to use them as individual characters we have placed a \ in front of them. This translates into $50.00 during the scan, which would find a match in the string, “You want to donate $50.00 to L. Spiro for his software, don’t you?
  • To find the word “L. Spiro”, we would use the regular expessionL\.\ Spiro”. The period and space both have a \ in front of them.
  • We could also find “L. Spiro” by using the regular expressionL\.\x20Spiro”, using the \x20 to specify a space.
Character Sets in Brackets

Characters inside brackets are used to indicate that any of the characters in the brackets can cause a match at that location in the test string.

Using the hyphen, you can indicate a range of characters that can be valid matches.

Using the ^ special character, you can indicate that the inverse set of characters is matched, instead of the characters in the brackets.

  • [_A9]” — Indicates that either an _, an A, or a 9 will cause a match.
  • [A-Z]” — Indicates that any character from A to Z (inclusive and case-sensitive) will cause a match.
  • [_A-Za-z]” — Indicates that an _, any character from A to Z, or any character from a to z will cause a match.
  • [^A-Za-z]” — Any character except characters from A to Z and from a to z will cause a match. The ^ special character must be used immediately after the [.
  • [\^A-Z]” — The \ causes the ^ to be treated as an individual character rather than a character with special meaning. So the match will occur on ^ and on any character from A to Z.
Dot (.) Matches any character. Same meaning as ? in wildcard strings.
  • B.b” — Matches Bbb, Bob, etc.
Identifiers Prefixed With $ Indicates a regular-expression macro defined by the user or the system. MHS allows you to define macros for regular expressions so that you do not have to type the entire regular expression each time you want to use it.
Parentheses Like in math, parentheses cause their contents to become the most basic level.
  • (ui)” — Two individual characters encloses in parentheses. Same as just typing “ui” in this case.
  • Any of the examples above could be enclosed in parentheses.
  • These are usually used with operators below to indicate the operator works on the entire set of elements inside the parentheses.

Regular-expression operators are used to indicate an action to be used for matching.

Operator Description Example
* Repetition 0 or more times.
  • [0-9]*” — Matches 9, 90, 7897938796, and even an empty string.
+ Repetition 1 or more times.
  • [0-9]+” — Matches 9, 90, and 7897938796, but not an empty string.
? At most one occurrence, or 0 or 1 repetitions.
  • Number[0-9]?” — Matches Number0, Number8, and Number, but not Number78, etc.
\i Ignore case.
  • (L\.\ Spiro)\i” — Matches L. Spiro in a case-insensitive way.

All of the above can be combined to form full regular expression strings.

  • [_A-Za-z][_A-Za-z0-9]* Matches a valid C/C++/Java identifier. The first character can be any alpha character or an underscore, and can be followed by 0 or more alpha-numeric characters or underscores.
  • 0[0-7]* Matches an octal number.
  • 0[xX][0-9A-Fa-f]+ Matches a hexadecimal number, including the 0x (or 0X) prefix.

Finally, on the highest level, regular expressions can be separated by a |. The expressions on the left and right sides of the | are alternatives—either of them can cause a match in the control string.

  • (0[0-7]*)|(0[xX][0-9A-Fa-f]+) Matches either an octal number or a hexadecimal number.
  • ((L\.\ Spiro)\i)|((I\.\ Omega)\i) Matches either L. Spiro or I. Omega, both in case-insensitive ways.

 

Macros

Macros are used to make life easier by shortening what is required to type. We can associate the regular expression[_A-Za-z][_A-Za-z0-9]*” with the macro “$Identifier”, and from then forth we only need to type “$Identifier” instead of “[_A-Za-z][_A-Za-z0-9]*”. Macros can be used to make long regular expressions easier to understand. Consider the following macros.

  • $Int = [0-9]+
  • $Frac = \.[0-9]+
  • $Exp = ([Ee](\+|-)?[0-9]+)

Now we can make a regular expression for a floating-point number by using the following:

  • ($Int? $Frac $Exp? | $Int \. $Exp? | $Int $Exp)[fFlL]?

Simple, huh?

Well just start small and work up as you understand more and more of how regular expressions work.

The regular-expression parser was written by Martin Holzherr.
Copyright © 2006 Shawn (L. Spiro) Wilcoxen