learning regular expression

The n-th attempt to learn how regular expression works.

Based on Linux Command Line and Shell Scripting Bible.

What

The regular expression engine reads the expression. There are 2 kinds of popular engines:

  • POSIX basic regular expression, BRE

  • POSIX extended regular expression, ERE

Most of Linux tools support at least BRE. sed supports only BRE, while awk supports ERE.

BRE

RE is case sensitive.

special character: .*[]^${}+?|() / use \. instead.

  • ^ denotes the head of the line. If it not appeared at the beginning of the expression, it is a normal character

  • $ denotes the end of the line.

  • . denotes there must be a character. e.g. .at does not match at.

  • character class []. any character appears in [] is ok. use to exclusive some characters.

special character class
  • * denotes this character appears 0 or any time. use ‘bla.*bla’ to skip some words. it also works for character class.

ERE

  • ? denotes the character can appear 0 or 1 time.
  • + denotes the character must appear at least once.
  • {} denotes the interval, which means the character must appears given m times(e.g. a{1}), or at least m times and at most n times(e.g. a{1,2}). Note in gawk ‘—re-interval’ must be specified.
  • | denotes any expression matched is ok.
  • () denotes the words in the () is seemed as one character.

Other RE pattern

简写字符集

简写 描述
. 匹配除换行符以外的任意字符
\w 匹配所有字母和数字的字符:[a-zA-Z0-9_]
\W 匹配非字母和数字的字符:[^\w]
\d 匹配数字:[0-9]
\D 匹配非数字:[^\d]
\s 匹配空格符:[\t\n\f\r\p{Z}]
\S 匹配非空格符:[^\s]

断言

断言保证我们所匹配的字符串满足断言所描述的模式。(但是该断言描述的模式并不属于匹配的字符串)

符号 描述
?= 正向先行断言
?! 负向先行断言
?<= 正向后行断言
?<! 负向后行断言

以正向后行断言举例。(?<=(T|t)he\s)(fat|mat) 匹配The或the后面的fat或mat

标记

标记 描述
i 不区分大小写:将匹配设置为不区分大小写。
g 全局搜索:搜索整个输入字符串中的所有匹配。
m 多行匹配:会匹配输入字符串每一行。

如正则表达式 /at(.)?$/gm,表示:小写字母 a,后跟小写字母 t,匹配除了换行符以外任意字符零次或一次。而且因为 m 标记,现在正则表达式引擎匹配字符串中每一行的末尾。