- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

Python re模块

Yuchuan 发表于 2020/01/14 23:04:03 2020/01/14

【摘要】不管以后你是不是去做python开发，只要你是一个程序员就应该了解正则表达式的基本使用。如果未来你要在爬虫领域发展，你就更应该好好学习这方面的知识。

re模块

讲正题之前我们先来看一个例子：https://reg.jd.com/reg/person?ReturnUrl=https%3A//www.jd.com/

这是京东的注册页面，打开页面我们就看到这些要求输入个人信息的提示。
假如我们随意的在手机号码这一栏输入一个11111111111，它会提示我们格式有误。
这个功能是怎么实现的呢？

假如现在你用python写一段代码，类似：

phone_num = input("please input your phone num:")

你怎么判断这个phone_number是合法的呢？

根据手机号码一共11位并且是只以13、14、15、18开头的数字这些特点，我们用python写了如下代码：

while 1:
    phone_num = input("please input your phone num:")
    if len(phone_num) == 11 \
            and phone_num.isdigit() \
            and (phone_num.startswith("13")
                 or phone_num.startswith("14")
                 or phone_num.startswith("15")
                 or phone_num.startswith("18")):
        print("手机号码合法")
    else:
        print("手机号码不合法")

结果：

D:\YuchuanProjectData\PythonProject\venv\Scripts\python.exe D:/YuchuanProjectData/PythonProject/YuchuanDemo004.py
please input your phone num:45654789254
手机号码不合法
please input your phone num:45624
手机号码不合法
please input your phone num:13145697445
手机号码合法
please input your phone num:18654286954
手机号码合法
please input your phone num:1465489875
手机号码不合法
please input your phone num:
手机号码不合法
please input your phone num:
Process finished with exit code -1

这是你的写法，现在我要展示一下我的写法：

import re

while True:
    phone_num = input("please input your phone num:")
    if re.match('^(13|14|15|18)[0-9]{9}$', phone_num):
        print("手机号码合法")
    else:
        print("手机号码不合法")

结果：

D:\YuchuanProjectData\PythonProject\venv\Scripts\python.exe D:/YuchuanProjectData/PythonProject/YuchuanDemo004.py
please input your phone num:14365478954
手机号码合法
please input your phone num:13457896547
手机号码合法
please input your phone num:16547895478
手机号码不合法
please input your phone num:154754665
手机号码不合法
please input your phone num:

对比上面的两种写法，此时此刻，我要问你你喜欢哪种方法呀？你肯定还是会说第一种，为什么呢？因为第一种不用学呀！
但是如果现在有一个文件，我让你从整个文件里匹配出所有的手机号码。你用python给我写个试试？
但是学了今天的技能之后，分分钟帮你搞定！

今天我们要学习python里的re模块和正则表达式，学会了这个就可以帮我们解决刚刚的疑问。正则表达式不仅在python领域，在整个编程界都占有举足轻重的地位。

不管以后你是不是去做python开发，只要你是一个程序员就应该了解正则表达式的基本使用。如果未来你要在爬虫领域发展，你就更应该好好学习这方面的知识。
但是你要知道，re模块本质上和正则表达式没有一毛钱的关系。re模块和正则表达式的关系 类似于 time模块和时间的关系
你没有学习python之前，也不知道有一个time模块，但是你已经认识时间了 12:30就表示中午十二点半（这个时间可好，一般这会儿就该下课了）。
时间有自己的格式，年月日时分秒，12个月，365天......已经成为了一种规则。你也早就牢记于心了。time模块只不过是python提供给我们的可以方便我们操作时间的一个工具而已

正则表达式和re模块

正则表达式本身也和python没有什么关系，就是匹配字符串内容的一种规则。

官方定义：正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

re模块下的常用方法

import re

ret = re.findall('a', 'hello word all are')  # 返回所有满足匹配条件的结果,放在列表里
print(ret)  # 结果 : ['a', 'a']

ret = re.search('a', 'hello word all are').group()
print(ret)  # 结果 : 'a'
# 函数会在字符串内查找模式匹配,直到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以
# 通过调用group()方法得到匹配的字符串,如果字符串没有匹配，则返回None。

ret = re.match('a', 'all').group()  # 同search,不过只在字符串开始处进行匹配
print(ret)

ret = re.split('[ab]', 'abcd')  # 先按'a'分割得到''和'bcd',在对''和'bcd'分别按'b'分割
print(ret)  # ['', '', 'cd']

ret = re.sub('\d', 'H', 'all3xuan4yuan4', 2)  # 将数字替换成'H'，参数1表示只替换1个
print(ret)  # allHxuanHyuan4

ret = re.subn('\d', 'H', 'all3xuan4yuan4')  # 将数字替换成'H'，返回元组(替换的结果,替换了多少次)
print(ret)

obj = re.compile('\d{3}')  # 将正则表达式编译成为一个 正则表达式对象，规则要匹配的是3个数字
ret = obj.search('abc987efg')  # 正则表达式对象调用search，参数为待匹配的字符串
print(ret.group())  # 结果 ： 987

# import re
ret = re.finditer('\d', 'ab3cd4784e')  # finditer返回一个存放匹配结果的迭代器
print(ret)  # <callable_iterator object at 0x00000204AA13E1D0>
print(next(ret).group())  # 查看第一个结果
print(next(ret).group())  # 查看第二个结果
print([i.group() for i in ret])  # 查看剩余的左右结果

结果：

D:\YuchuanProjectData\PythonProject\venv\Scripts\python.exe D:/YuchuanProjectData/PythonProject/YuchuanDemo004.py
['a', 'a']
a
a
['', '', 'cd']
allHxuanHyuan4
('allHxuanHyuanH', 3)
987
<callable_iterator object at 0x00000204AA13E1D0>
3
4
['7', '8', '4']
Process finished with exit code 0

注意：

1 findall的优先级查询：

import re

ret = re.findall('www.(baidu|google).com', 'www.google.com')
print(ret)  # ['google']     这是因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可

ret = re.findall('www.(?:baidu|google).com', 'www.google.com')
print(ret)  # ['www.google.com']

结果：

D:\YuchuanProjectData\PythonProject\venv\Scripts\python.exe D:/YuchuanProjectData/PythonProject/YuchuanDemo004.py
['google']
['www.google.com']
Process finished with exit code 0

2 split的优先级查询

ret = re.split("\d+", "all3xuan4yuan4")
print(ret)  # 结果 ： ['all', 'xuan', 'yuan', '']

ret = re.split("(\d+)", "all3xuan4yuan4")
print(ret)  # 结果 ： ['all', '3', 'xuan', '4', 'yuan', '4', '']

# 在匹配部分加上（）之后所切出的结果是不同的，
# 没有（）的没有保留所匹配的项，但是有（）的却能够保留了匹配的项，
# 这个在某些需要保留匹配部分的使用过程是非常重要的。

综合练习与扩展

1、匹配标签

import re

ret = re.search("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>", "<h1>helloworld</h1>")
# 还可以在分组中利用?<name>的形式给分组起名字
# 获取的匹配结果可以直接用group('名字')拿到对应的值
print(ret.group('tag_name'))  # 结果 ：h1
print(ret.group())  # 结果 ：<h1>helloworld</h1>

ret = re.search(r"<(\w+)>\w+</\1>", "<h1>helloworld</h1>")
# 如果不给组起名字，也可以用\序号来找到对应的组，表示要找的内容和前面的组内容一致
# 获取的匹配结果可以直接用group(序号)拿到对应的值
print(ret.group(1))
print(ret.group())  # 结果 ：<h1>helloworld</h1>

2、匹配整数

import re

ret = re.findall(r"\d+", "1-2*(60+(-40.35/5)-(-4*3))")
print(ret)  # ['1', '2', '60', '40', '35', '5', '4', '3']
ret = re.findall(r"-?\d+\.\d*|(-?\d+)", "1-2*(60+(-40.35/5)-(-4*3))")
print(ret)  # ['1', '-2', '60', '', '5', '-4', '3']
ret.remove("")
print(ret)  # ['1', '-2', '60', '5', '-4', '3']

3、数字匹配

1、 匹配一段文本中的每行的邮箱
      http://blog.csdn.net/make164492212/article/details/51656638

2、 匹配一段文本中的每行的时间字符串，比如：‘1990-07-12’；

   分别取出1年的12个月（^(0?[1-9]|1[0-2])$）、
   一个月的31天：^((0?[1-9])|((1|2)[0-9])|30|31)$

3、 匹配qq号。(腾讯QQ号从10000开始)  ［1,9］[0,9]{4,}

4、 匹配一个浮点数。       ^(-?\d+)(\.\d+)?$   或者  -?\d+\.?\d*

5、 匹配汉字。             ^[\u4e00-\u9fa5]{0,}$ 

6、 匹配出所有整数

4、爬虫练习

import requests

import re
import json

def getPage(url):

    response=requests.get(url)
    return response.text

def parsePage(s):
    
    com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
                   '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S)

    ret=com.finditer(s)
    for i in ret:
        yield {
            "id":i.group("id"),
            "title":i.group("title"),
            "rating_num":i.group("rating_num"),
            "comment_num":i.group("comment_num"),
        }

def main(num):

    url='https://movie.douban.com/top250?start=%s&filter='%num
    response_html=getPage(url)
    ret=parsePage(response_html)
    print(ret)
    f=open("move_info7","a",encoding="utf8")

    for obj in ret:
        print(obj)
        data=json.dumps(obj,ensure_ascii=False)
        f.write(data+"\n")

if __name__ == '__main__':
    count=0
    for i in range(10):
        main(count)
        count+=25

简化版：

import re
import json
from urllib.request import urlopen

def getPage(url):
    response = urlopen(url)
    return response.read().decode('utf-8')

def parsePage(s):
    com = re.compile(
        '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
        '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)

    ret = com.finditer(s)
    for i in ret:
        yield {
            "id": i.group("id"),
            "title": i.group("title"),
            "rating_num": i.group("rating_num"),
            "comment_num": i.group("comment_num"),
        }


def main(num):
    url = 'https://movie.douban.com/top250?start=%s&filter=' % num
    response_html = getPage(url)
    ret = parsePage(response_html)
    print(ret)
    f = open("move_info7", "a", encoding="utf8")

    for obj in ret:
        print(obj)
        data = str(obj)
        f.write(data + "\n")

count = 0
for i in range(10):
    main(count)
    count += 25

flags有很多可选值：

re.I(IGNORECASE)忽略大小写，括号内是完整的写法
re.M(MULTILINE)多行模式，改变^和$的行为
re.S(DOTALL)点可以匹配任意字符，包括换行符
re.L(LOCALE)做本地化识别的匹配，表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境，不推荐使用
re.U(UNICODE) 使用\w \W \s \S \d \D使用取决于unicode定义的字符属性。在python3中默认使用该flag
re.X(VERBOSE)冗长模式，该模式下pattern字符串可以是多行的，忽略空白字符，并可以添加注释

flags

作业

实现能计算类似 
1 - 2 * ( (60-30 +(-40/5) * (9-2*5/3 + 7 /3*99/4*2998 +10 * 568/14 )) - (-4*3)/ (16-3*2) )等类似公式的计算器程序

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

Python re模块

re模块下的常用方法

综合练习与扩展

1、匹配标签

2、匹配整数

3、数字匹配

4、爬虫练习

作业

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

Python re模块

re模块下的常用方法

综合练习与扩展

1、匹配标签

2、匹配整数

3、数字匹配

4、爬虫练习

作业

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品