正则表达式从“入门”到“入门”

博主：编程我只用CPP
发布时间：2017 年 10 月 16 日
39 次浏览
暂无评论
4177字数
分类：编程语言

一、概述

正则表达式，又称正规表示式、正规表示法、正规表达式、规则表达式、常规表示法（英语：Regular Expression，在代码中常简写为regex、regexp或RE），是计算机科学的一个概念。正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。在很多文本编辑器里，正则表达式通常被用来检索、替换那些匹配某个模式的文本。

Regular Expression的“Regular”一般被译为“正则”、“正规”、“常规”。此处的“Regular”即是“规则”、“规律”的意思，Regular Expression即“描述某种规则的表达式”之意。

本篇将介绍正则表达式的基本语法，所有代码基于python 完成，环境：python2.7 + re 模块，python 操作正则的方法详见： python正则表达式的使用方法

二、语法规则

2.1 元字符

2.1.1 规则

\s：匹配空白区域，空白区域也包含空格, \t, \n等。
\d：匹配数字0-9。
\w：匹配字母、数字或者下划线。
\b：匹配边界，单词的便捷或者字符串的开头和结尾。
.：匹配除换行符以外的所有字符。
^：匹配字符串的开始。
$：匹配字符串的结束。
[]：匹配[]中列举的字符，例如[abc]可以匹配a、b或者c字符。

2.1.2 案例

匹配"HelloWorld"中的"Hello"

s = "HelloWorld"
p = r"Hello"
rs = re.match(p, s)
print rs.group()  #Hello

匹配手机号码

s = "13977889988"
p = r"1\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d"
rs = re.match(p, s)
if rs != None:
    print rs.group()  #13977889988
else:
    print "no matched"

匹配非0开头的两位数

s = "02 33 45 87 09"
p = r"[1-9]\d"
print re.findall(p, s)  # ["33", "45", "87"]

匹配以小写字母开头的字符串

s = "adf Bdc A45 e87 c09"
p = r"[a-z]\w"
print re.findall(p, s)  # ["adf", "e87", "c09"]

2.2 限定符

限定符用来限定字符出现的次数

2.2.1 规则

*：重复0次或以上。
+：重复0次以上。
?：重复0次或者1次。
{n}：重复n次。
{m, n}：重复m-n次。
{n, }：重复n次以上。

2.2.2 使用案例

在上面，判断手机号码需要写10个d，有了限定符之后可以这样写：

s = "13977889988"
p = r"1\d{10}"
rs = re.match(p, s)
if rs != None:
    print rs.group()  #13977889988
else:
    print "no matched"

找出一段字符串中所有的三位数

s = "100 88 9 112 9998 197 9876 77"
p = r"\d{3}"
print re.findall(p, s)  #["100", "112", "197"]

找出所有长度为4-5的单词

s = "abc hello defg world higkli"
p = r"\w{4,5}"
print re.findall(p, s)  #["hello", "defg", "world"]

2.3反义符

2.3.1 规则

\W：匹配任意不是字母、数字以及下划线的字符。
\D：匹配非数字。
\S：匹配非空白字符。
\B：匹配非边界。
[^x]：匹配除x以外的字符。
[^abc]：匹配除a、b、c以外的字符。

2.3.2 案例

找出字符串中所有的非空白部分

s = "hello 123 {world} world [456] "
p1 = r"\S+"
print re.findall(p1, s)  #["hello", "123", "world", "world", "456"]

找到所有的非0开头的三位数

s = "012 3456 789 1011 999"
p = r"[^0]\d{2}"
print re.findall(p, s)  #789 999

2.4 分组

2.4.1 设置分组

相对于上面的内容来说，分组算是正则语法中的高级部分了，相对也复杂一点，个人感觉分组也是正则表达式的精髓所在，只要用好了分组，正则表达式将会变得非常灵活。

要想把一个匹配内容作为分组，只需用括号包起来即可，例如匹配一段html代码中的标签：

s = "<html><head>This is regex</head></html>"
p = r"<(\w+)><(\w+)>([\w\s]*)<(\w+)><(\w+)>"
rs = re.match(p, s)
if rs != None:
    print rs.groups()  # ("html", "head", "This is regex", "/head", "/html")
    print rs.group(1)  # html
    print rs.group(2)  # head
    print rs.group(3)  # This is regex
    print rs.group(4)  # /head
    print rs.group(5)  # /html
else :
    print "no matched"

这里的把匹配字符串<(\w+)><(\w+)>([\w\s]*)<(\w+)><(\w+)> 分为了五组，分别是五个括号包起来的区块。

2.4.2 引用分组

假设把上面匹配html标签例子的字符串改成<head><html>This is regex</head></html> 再使用同样的正则表达式来匹配，发现同样也能匹配到结果("head", "html", "This is regex", "/head", "/html") ，然而在网页中这段代码就是错误的，因为标签根本不匹配。

这个问题要怎么解决，这里就需要用到引用分组 了，引用分组其实就是引用匹配过程中前面分组匹配到的字符串，引用的方法是\+分组序号 ，例如\1表示引用第一个分组，在上面的例子中就相当于引用html字符串。

有了引用分组之后，上面的html匹配就可以改为：

s = "<html><head>This is regex</head></html>"
p = r"<(\w+)><(\w+)>([\w\s]*)</\2></\1>"
rs = re.match(p, s)
if rs != None:
    print rs.groups()  # ("html", "head", "This is regex", "head", "html")
else :
    print "no matched"

这时，再把匹配字符串修改为<head><html>This is regex</head></html> 就不会有匹配，将会输出no matched 。

2.4.3 给分组取别名

引用分组有两种方式，一种是使用序号引用，另一种是取别名引用，规则为：

(?P<name>)：给分组设置别名。
(?P=name)：引用name分组匹配到的字符串。

使用别名匹配html 标签：r"<(?P<html_tag>w+)><(?P<head_tag>w+)>([ws]*)</(?P=head_tag)></(?P=html_tag)>"

2.5 贪婪模式和非贪婪模式

假设有一段字符串如下所示：

MaQian,HuNan,166-7788-8877

我想匹配出其中的手机号码，正则表达式为：(.*)(d*-d*-d*)

s = "MaQian HuNan 168-8877-7788"
p = r"(.+)(d*-d*-d*)"
rs = re.match(p, s)
if rs == None:
    print "no matched"
else:
    print rs.group(1)
    print rs.group(2)

按照预想，结果应该为：

MaQian HuNan
168-8877-7788

然而实际上当我们运行程序之后发现结果为：

MaQian HuNan 168
-8877-7788

和想象中的并不一样，这是为什么呢？

其实仔细一看也能发现，字符串168 也属于.* 的匹配范围之内，所以168 默认匹配到了第一个分组里去了。

这里涉及到正则的贪婪运算，贪婪的意思是尽可能多，在满足匹配条件的情况下，尽可能多的匹配当前规则字符串。默认情况下正则表达式是贪婪的，如果要取消贪婪模式，只要在限定符后面加一个? 就可以了，规则如下：

*?：重复一次或多次，尽可能少重复
+?：重复一次以上，尽可能少重复
??：重复0次或1次，尽可能少重复

所以上面的正则写成(.+?)(d+-d+-d+) 就能按照预想来输出了。

最后修改：2017 年 10 月 16 日

喜欢就给我点赞吧

此处评论已关闭

ex
没有配置图，不知道在哪里加的。
呆萌
最后是咋回事啊？咋没说完呢。
邢艾莎
该评论仅登录用户及评论双方可见
邢艾莎
该评论仅登录用户及评论双方可见

正则表达式从“入门”到“入门”

编程我只用CPP • 2017 年 10 月 16 日

<h2>一、概述</h2><p>正则表达式，又称正规表示式、正规表示法、正规表达式、规则表达式、常规表示法（英语：Regular Expression，在代码中常简写为regex、regexp或RE），是计算机科学的一个概念。正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。在很多文本编辑器里，正则表达式通常被用来检索、替换那些匹配某个模式的文本。</p><p>Regular Expression的“Regular”一般被译为“正则”、“正规”、“常规”。此处的“Regular”即是“规则”、“规律”的意思，Regular Expression即“描述某种规则的表达式”之意。</p><p>本篇将介绍正则表达式的基本语法，所有代码基于<code>python</code> 完成，环境：<code>python2.7 + re</code> 模块，<code>python</code> 操作正则的方法详见： <span class="external-link"><a class="no-external-link" href="http://www.dyxmq.cn/other/regex-rumen.html" target="_blank"><i data-feather="external-link"></i>python正则表达式的使用方法</a></span></p><h2>二、语法规则</h2><h3>2.1 元字符</h3><h4>2.1.1 规则</h4><ul><li><code>\s</code>：匹配空白区域，空白区域也包含<code>空格</code>, <code>\t</code>, <code>\n</code>等。</li><li><code>\d</code>：匹配数字<code>0-9</code>。</li><li><code>\w</code>：匹配字母、数字或者下划线。</li><li><code>\b</code>：匹配边界，单词的便捷或者字符串的开头和结尾。</li><li><code>.</code>：匹配除换行符以外的所有字符。</li><li><code>^</code>：匹配字符串的开始。</li><li><code>$</code>：匹配字符串的结束。</li><li><code>[]</code>：匹配[]中列举的字符，例如[abc]可以匹配a、b或者c字符。</li></ul><h4>2.1.2 案例</h4><ul><li>匹配"HelloWorld"中的"Hello"</li></ul><pre><code class="lang-python">s = &quot;HelloWorld&quot;
p = r&quot;Hello&quot;
rs = re.match(p, s)
print rs.group()  #Hello</code></pre><ul><li>匹配手机号码</li></ul><pre><code class="lang-python">s = &quot;13977889988&quot;
p = r&quot;1\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d&quot;
rs = re.match(p, s)
if rs != None:
    print rs.group()  #13977889988
else:
    print &quot;no matched&quot;</code></pre><ul><li>匹配非0开头的两位数</li></ul><pre><code class="lang-python">s = &quot;02 33 45 87 09&quot;
p = r&quot;[1-9]\d&quot;
print re.findall(p, s)  # [&quot;33&quot;, &quot;45&quot;, &quot;87&quot;]</code></pre><ul><li>匹配以小写字母开头的字符串</li></ul><pre><code class="lang-python">s = &quot;adf Bdc A45 e87 c09&quot;
p = r&quot;[a-z]\w&quot;
print re.findall(p, s)  # [&quot;adf&quot;, &quot;e87&quot;, &quot;c09&quot;]</code></pre><h3>2.2 限定符</h3><p>限定符用来限定字符出现的次数</p><h4>2.2.1 规则</h4><ul><li><code>*</code>：重复0次或以上。</li><li><code>+</code>：重复0次以上。</li><li><code>?</code>：重复0次或者1次。</li><li><code>{n}</code>：重复n次。</li><li><code>{m, n}</code>：重复m-n次。</li><li><code>{n, }</code>：重复n次以上。</li></ul><h4>2.2.2 使用案例</h4><ul><li>在上面，判断手机号码需要写10个<code>d</code>，有了限定符之后可以这样写：</li></ul><pre><code class="lang-python">s = &quot;13977889988&quot;
p = r&quot;1\d{10}&quot;
rs = re.match(p, s)
if rs != None:
    print rs.group()  #13977889988
else:
    print &quot;no matched&quot;</code></pre><ul><li>找出一段字符串中所有的三位数</li></ul><pre><code class="lang-python">s = &quot;100 88 9 112 9998 197 9876 77&quot;
p = r&quot;\d{3}&quot;
print re.findall(p, s)  #[&quot;100&quot;, &quot;112&quot;, &quot;197&quot;]</code></pre><ul><li>找出所有长度为4-5的单词</li></ul><pre><code class="lang-python">s = &quot;abc hello defg world higkli&quot;
p = r&quot;\w{4,5}&quot;
print re.findall(p, s)  #[&quot;hello&quot;, &quot;defg&quot;, &quot;world&quot;]</code></pre><h3>2.3反义符</h3><h4>2.3.1 规则</h4><ul><li><code>\W</code>：匹配任意不是字母、数字以及下划线的字符。</li><li><code>\D</code>：匹配非数字。</li><li><code>\S</code>：匹配非空白字符。</li><li><code>\B</code>：匹配非边界。</li><li><code>[^x]</code>：匹配除x以外的字符。</li><li><code>[^abc]</code>：匹配除a、b、c以外的字符。</li></ul><h4>2.3.2 案例</h4><ul><li>找出字符串中所有的非空白部分</li></ul><pre><code class="lang-python">s = &quot;hello 123 {world} world [456] &quot;
p1 = r&quot;\S+&quot;
print re.findall(p1, s)  #[&quot;hello&quot;, &quot;123&quot;, &quot;world&quot;, &quot;world&quot;, &quot;456&quot;]</code></pre><ul><li>找到所有的非0开头的三位数</li></ul><pre><code class="lang-python">s = &quot;012 3456 789 1011 999&quot;
p = r&quot;[^0]\d{2}&quot;
print re.findall(p, s)  #789 999</code></pre><h3>2.4 分组</h3><h4>2.4.1 设置分组</h4><p>相对于上面的内容来说，分组算是正则语法中的高级部分了，相对也复杂一点，个人感觉分组也是正则表达式的精髓所在，只要用好了分组，正则表达式将会变得非常灵活。</p><p>要想把一个匹配内容作为分组，只需用括号包起来即可，例如匹配一段html代码中的标签：</p><pre><code class="lang-python">s = &quot;&lt;html&gt;&lt;head&gt;This is regex&lt;/head&gt;&lt;/html&gt;&quot;
p = r&quot;&lt;(\w+)&gt;&lt;(\w+)&gt;([\w\s]*)&lt;(\w+)&gt;&lt;(\w+)&gt;&quot;
rs = re.match(p, s)
if rs != None:
    print rs.groups()  # (&quot;html&quot;, &quot;head&quot;, &quot;This is regex&quot;, &quot;/head&quot;, &quot;/html&quot;)
    print rs.group(1)  # html
    print rs.group(2)  # head
    print rs.group(3)  # This is regex
    print rs.group(4)  # /head
    print rs.group(5)  # /html
else :
    print &quot;no matched&quot;</code></pre><p>这里的把匹配字符串<code>&lt;(\w+)&gt;&lt;(\w+)&gt;([\w\s]*)&lt;(\w+)&gt;&lt;(\w+)&gt;</code> 分为了五组，分别是五个括号包起来的区块。</p><h4>2.4.2 引用分组</h4><p>假设把上面匹配html标签例子的字符串改成<code>&lt;head&gt;&lt;html&gt;This is regex&lt;/head&gt;&lt;/html&gt;</code> 再使用同样的正则表达式来匹配，发现同样也能匹配到结果<code>(&quot;head&quot;, &quot;html&quot;, &quot;This is regex&quot;, &quot;/head&quot;, &quot;/html&quot;)</code> ，然而在网页中这段代码就是错误的，因为标签根本不匹配。</p><p>这个问题要怎么解决，这里就需要用到<code>引用分组</code> 了，引用分组其实就是引用匹配过程中前面分组匹配到的字符串，引用的方法是<code>\+分组序号</code> ，例如<code>\1</code>表示引用第一个分组，在上面的例子中就相当于引用<code>html</code>字符串。</p><p>有了引用分组之后，上面的html匹配就可以改为：</p><pre><code class="lang-python">s = &quot;&lt;html&gt;&lt;head&gt;This is regex&lt;/head&gt;&lt;/html&gt;&quot;
p = r&quot;&lt;(\w+)&gt;&lt;(\w+)&gt;([\w\s]*)&lt;/\2&gt;&lt;/\1&gt;&quot;
rs = re.match(p, s)
if rs != None:
    print rs.groups()  # (&quot;html&quot;, &quot;head&quot;, &quot;This is regex&quot;, &quot;head&quot;, &quot;html&quot;)
else :
    print &quot;no matched&quot;</code></pre><p>这时，再把匹配字符串修改为<code>&lt;head&gt;&lt;html&gt;This is regex&lt;/head&gt;&lt;/html&gt;</code> 就不会有匹配，将会输出<code>no matched</code> 。</p><h4>2.4.3 给分组取别名</h4><p>引用分组有两种方式，一种是使用序号引用，另一种是取别名引用，规则为：</p><ul><li><code>(?P&lt;name&gt;)</code>：给分组设置别名。</li><li><code>(?P=name)</code>：引用name分组匹配到的字符串。</li></ul><p>使用别名匹配<code>html</code> 标签：<code>r&quot;&lt;(?P&lt;html_tag&gt;w+)&gt;&lt;(?P&lt;head_tag&gt;w+)&gt;([ws]*)&lt;/(?P=head_tag)&gt;&lt;/(?P=html_tag)&gt;&quot;</code></p><h3>2.5 贪婪模式和非贪婪模式</h3><p>假设有一段字符串如下所示：</p><pre><code class="lang-python">MaQian,HuNan,166-7788-8877</code></pre><p>我想匹配出其中的手机号码，正则表达式为：<code>(.*)(d*-d*-d*)</code></p><pre><code class="lang-python">s = &quot;MaQian HuNan 168-8877-7788&quot;
p = r&quot;(.+)(d*-d*-d*)&quot;
rs = re.match(p, s)
if rs == None:
    print &quot;no matched&quot;
else:
    print rs.group(1)
    print rs.group(2)</code></pre><p>按照预想，结果应该为：</p><pre><code class="lang-default">MaQian HuNan
168-8877-7788</code></pre><p>然而实际上当我们运行程序之后发现结果为：</p><pre><code class="lang-default">MaQian HuNan 168
-8877-7788</code></pre><p>和想象中的并不一样，这是为什么呢？</p><p>其实仔细一看也能发现，字符串<code>168</code> 也属于<code>.*</code> 的匹配范围之内，所以<code>168</code> 默认匹配到了第一个分组里去了。</p><p>这里涉及到正则的贪婪运算，贪婪的意思是尽可能多，在满足匹配条件的情况下，尽可能多的匹配当前规则字符串。默认情况下正则表达式是贪婪的，如果要取消贪婪模式，只要在限定符后面加一个<code>?</code> 就可以了，规则如下：</p><ul><li><code>*?</code>：重复一次或多次，尽可能少重复</li><li><code>+?</code>：重复一次以上，尽可能少重复</li><li><code>??</code>：重复0次或1次，尽可能少重复</li></ul><p>所以上面的正则写成<code>(.+?)(d+-d+-d+)</code> 就能按照预想来输出了。</p>

正则表达式从“入门”到“入门”

一、概述

二、语法规则

2.1 元字符

2.1.1 规则

2.1.2 案例

2.2 限定符

2.2.1 规则

2.2.2 使用案例

2.3反义符

2.3.1 规则

2.3.2 案例

2.4 分组

2.4.1 设置分组

2.4.2 引用分组

2.4.3 给分组取别名

2.5 贪婪模式和非贪婪模式

nginx安装modsecurity实现waf功能

一次孤儿socket过多导致系统异常的问题排查过程

踩坑记录：CDN开启强制https之后返回重定向次数过多的问题

shell中数组的使用方法

tcpdump的基本用法

docker查看镜像每层大小

Redis学习笔记：散列类型

[leetcode]410-分割数组的最大值

[leetcode]322-零钱兑换

godoc的web版本开启方式

正则表达式从“入门”到“入门”