Regular Expressions Find and Replace in the Scripting Editor

The following topic is based on the public domain document located on http://www.scintilla.org/SciTERegEx.html

Purpose

Regular Expressions (RegEx) can be used for searching for patterns rather than literals.

For example, using the following posix-mode Regular Expression, it is possible to search for variables in property files which look like $(name.subname)

\$([a-z.]+)

This Regular Expression can be read like this:

Replacement with regular expressions allows for complex transformations with the use of tagged expressions.

For example, pairs of numbers separated by a comma could be reordered by replacing the posix-mode regular expression

([0-9]+),([0-9]+)

with:

\2,\1

The first RegEx means: Match one or more digits between 0 and 9 before the comma and one or more digits between 0 and 9 after the comma.

In non-posix mode, the same would be written as

\([0-9]+\),\([0-9]+\)

The replacement RegEx \\2, \\1 means: Take the second matched expression and put it before the comma, then place the first matched expression after the comma.

The first posix mode RegEx could even be written as

(\d+),(\d+))

which also means any number of digits before the comma and any number of digits after the comma.

Syntax Styles - Posix vs. Old Unix Style

Regular expression syntax depends on the parameter

find.replace.regexp.posix

If set to 0 (default), the Regular Expression syntax uses the old Unix style where \( and \) mark capturing sections while ( and ) represent themselves.

If set to 1, the RegEx syntax uses the more common style where opening and closing parentheses ( and ) mark capturing sections while \( and \) are plain parentheses.

Syntax Rules

[1] char

A character matches itself, unless it is a special character (metachar): . \ [ ] * + ^ $ and( ) in posix mode.

[2] .

matches any character.

[3] \

matches the character following it, except:

Backslash is used as an escape character for all other meta-characters, and itself.

[4] [set]

matches one of the characters in the set.

If the first character in the set is ^, it matches the characters NOT in the set, i.e. complements the set.

A shorthand S-E (start dash end) is used to specify a set of characters S up to E, inclusive.

The special characters ]and-have no special meaning if they appear as the first chars in the set. To include both, put - first: [-]A-Z], or just add a backslash before each of them, like [A-Z\]\-]

example match
`[-] ]`
`[]- ]`
[a-z] any lowercase alpha
[^-]] any char except - and ]
[^A-Z] any char except uppercase alpha
[a-zA-Z] any alpha

[5] *

any regular expression form [1] to [4] (except [7] , [8] and [9] forms of [3] ), followed by closure char ( * ) matches zero or more matches of that form.

For example, [a-z]* means "Zero or more occurrences of a lower case alpha character".

[6] +

same as [5] , except it matches one or more.

Both [5] and [6] are _greedy_- they match as many characters as possible, until a mismatch is encountered.

[7]

a regular expression in the form [1]to[12], enclosed as \(form\)(or(form)with posix flag) matches what form matches. The enclosure creates a set of tags, used for[8]and for pattern substitution. The tagged forms are numbered starting from 1.

[8]

a \ followed by a digit 1 to 9 matches whatever a previously tagged regular expression ([7]) matched.

[9] \< \>

a regular expression starting with a \< construct and/or ending with a \> construct, restricts the pattern matching to the beginning of a word, and/or the end of a word. A word is defined to be a character string beginning and/or ending with the characters A-Z a-z 0-9 and _. The Editor extends this definition by user setting. The word must also be preceded and/or followed by any character outside those mentioned.

[10] \l

a backslash followed by d, D, s, S, w or W, becomes a character class (both inside and outside [] sets).

[11] \xHH

a backslash followed by x and two hexadecimal digits, becomes the character whose Ascii code is equal to these digits. If not followed by two digits, it represents the character 'x' itself.

[12]

a composite regular expression xy where x and y are in the form [1] to [10] matches the longest match of x followed by a match for y.

[13] ^ $

a regular expression starting with a ^ character and/or ending with a $ character, restricts the pattern matching to the beginning of the line, or the end of line. Elsewhere in the pattern, ^ and $ are treated as ordinary characters.

Acknowledgments

Most of this documentation was originally written by Ozan S. Yigit. Additions by Neil Hodgson and Philippe Lhoste. All of this document is in the public domain.