The n-th attempt to learn how regular expression works.
Based on Linux Command Line and Shell Scripting Bible.
What
The regular expression engine reads the expression. There are 2 kinds of popular engines:
POSIX basic regular expression, BRE
POSIX extended regular expression, ERE
Most of Linux tools support at least BRE. sed supports only BRE, while awk supports ERE.
BRE
RE is case sensitive.
special character: .*[]^${}+?|() / use \. instead.
^ denotes the head of the line. If it not appeared at the beginning of the expression, it is a normal character
$ denotes the end of the line.
. denotes there must be a character. e.g. .at does not match at.
character class []. any character appears in [] is ok. use to exclusive some characters.
- * denotes this character appears 0 or any time. use ‘bla.*bla’ to skip some words. it also works for character class.
ERE
- ? denotes the character can appear 0 or 1 time.
- + denotes the character must appear at least once.
- {} denotes the interval, which means the character must appears given m times(e.g. a{1}), or at least m times and at most n times(e.g. a{1,2}). Note in gawk ‘—re-interval’ must be specified.
- | denotes any expression matched is ok.
- () denotes the words in the () is seemed as one character.
Other RE pattern
简写字符集
简写 | 描述 |
---|---|
. | 匹配除换行符以外的任意字符 |
\w | 匹配所有字母和数字的字符:[a-zA-Z0-9_] |
\W | 匹配非字母和数字的字符:[^\w] |
\d | 匹配数字:[0-9] |
\D | 匹配非数字:[^\d] |
\s | 匹配空格符:[\t\n\f\r\p{Z}] |
\S | 匹配非空格符:[^\s] |
断言
断言保证我们所匹配的字符串满足断言所描述的模式。(但是该断言描述的模式并不属于匹配的字符串)
符号 | 描述 |
---|---|
?= | 正向先行断言 |
?! | 负向先行断言 |
?<= | 正向后行断言 |
?<! | 负向后行断言 |
以正向后行断言举例。(?<=(T|t)he\s)(fat|mat)
匹配The或the后面的fat或mat
标记
标记 | 描述 |
---|---|
i | 不区分大小写:将匹配设置为不区分大小写。 |
g | 全局搜索:搜索整个输入字符串中的所有匹配。 |
m | 多行匹配:会匹配输入字符串每一行。 |
如正则表达式 /at(.)?$/gm
,表示:小写字母 a
,后跟小写字母 t
,匹配除了换行符以外任意字符零次或一次。而且因为 m
标记,现在正则表达式引擎匹配字符串中每一行的末尾。