[华为云在线课程][Python网络爬虫][爬虫工具使用实验][三][学习笔记]

举报
John2021 发表于 2022/06/15 06:54:52 2022/06/15
【摘要】 [华为云在线课程][Python网络爬虫][爬虫工具使用实验][三][学习笔记]本实验介绍了Python网络爬虫工具urllib和requests的使用,及自动化测试工具selenium的基础使用。 1.urllib基础使用 1.1.用urlopen发送请求import urllib.requestresposne = urllib.request.urlopen('http://www....

[华为云在线课程][Python网络爬虫][爬虫工具使用实验][三][学习笔记]

本实验介绍了Python网络爬虫工具urllib和requests的使用,及自动化测试工具selenium的基础使用。

1.urllib基础使用

1.1.用urlopen发送请求

import urllib.request

resposne = urllib.request.urlopen('http://www.baidu.com')
print(resposne.read().decode('utf-8'))

输出结果:

1.2.查看HTTP返回的状态码

print(resposne.getcode()) #200
print(resposne.status) #200

1.3.使用timeout参数设置超时时间

使用timeout参数设置请求时超过多长时间没收到响应则为连接超时(单位s)。

response = urllib.request.urlopen('http://www.baidu.com', timeout=0.001) #提示超时timeout

1.4.使用request构造请求

使用urlopen()方法可以实现最基本请求的发起,但这几个简单的参数并不足以构建一个完整的请求,如果请求中需要加入Headers等信息,就可以利用更强大的requests类来构建一个请求:
打开浏览器,使用F12进入开发者平台,在网址栏输入百度网址后,随便打开一个加载项。即可查看headers字段信息。

import urllib.request

# 构造headers
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"}
# 构造请求
request = urllib.request.Request('http://www.baidu.com', headers=headers)
# 发送请求
response = urllib.request.urlopen(request)
print(response.read())

输出结果:

返回结果为网页的源代码,byte类型。

1.5.异常处理

urllib的error模块定义了由request模块产生的异常。常见的两个异常URLError,HTTPError。
URLError

from urllib import request,error
try:
    response=request.urlopen('http://www.baidu.com',timeout=0.001)
except Exception as e:
    print(e)

输出结果:

<urlopen error timed out>

HTTPError

from urllib import request,error
try:
    response=request.urlopen('http://www.huawei_cloud.c/hwc.htm')
except error.URLError as e:
    print(e.reason)
    print(e)

输出结果:

[Errno 11001] getaddrinfo failed
<urlopen error [Errno 11001] getaddrinfo failed>

1.6.URL解析与合并

使用parse模块解析url:

import urllib.parse
url='https://www.baidu.com/s?id=utf-8&wd=Python'
response=urllib.parse.urlparse(url=url)
print(response)

输出结果:

ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='id=utf-8&wd=Python', fragment='')

拼接url:

from urllib import parse
base_url='https://www.baidu.com/s?ie=UTF-8&wd=Python'
sub_url='/info'
url='https://python.org'
print(parse.urljoin(base_url,sub_url)) # https://www.baidu.com/info
print(parse.urljoin(base_url,url)) # https://python.org

使用urlencode方法对url中的字符进行编码:

from urllib import parse
params={
    'wd':'中文',
    'ky':'python996'
}
params_str=parse.urlencode(params)
print(params_str) #wd=%E4%B8%AD%E6%96%87&ky=python996

使用quote方法对url中的中文进行编码

from urllib import parse
url='https://www.baidu.com/s?ie=UTF-8&wd='
str_=parse.quote('汉语')
url=url+str_
print(url) #https://www.baidu.com/s?ie=UTF-8&wd=%E6%B1%89%E8%AF%AD

1.7.urllib实践 - 获取图片数据

使用urllib获取图片数据,目标url:https://tieba.baidu.com/p/7308118680
获取页面中的所有壁纸数据。通过浏览器右键图片,检查,查看网页源代码。

随机选取几张图片的URL进行对比:

http://tiebapic.baidu.com/forum/w%3D580/sign=e65b16860d4c510faec4e21250582528/4093d0628535e5dd860e6d9161c6a7efce1b622b.jpg

http://tiebapic.baidu.com/forum/w%3D580/sign=15c25c3aba345982c58ae59a3cf5310b/bc8f9982d158ccbfee9e7d740ed8bc3eb135412b.jpg

http://tiebapic.baidu.com/forum/w%3D580/sign=15c25c3aba345982c58ae59a3cf5310b/bc8f9982d158ccbfee9e7d740ed8bc3eb135412b.jpg

图片URL构成非常相似:前半部分一样,http://tiebapic.baidu.com/forum/w%3D580/sign=; 后半部分由一些字母加数字和/组成,并且都是jpg格式的数据,可以编写正则表达式。

.*? 表示匹配任意字符,并且会尽可能少的匹配。
re_str=http://tiebapic.baidu.com/forum/w%3D580/sign=.*?/.*?\.jpg
import urllib.request
import re
URL='https://tieba.baidu.com/p/7308118680'
# 构建headers
headers={"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"}
# 构造请求
request=urllib.request.Request(URL,headers=headers)
# 发送请求
response=urllib.request.urlopen(request)
html=response.read().decode()
re_str='http://tiebapic.baidu.com/forum/w%3D580/sign=.*?/.*?\.jpg'
img_url=re.findall(re_str,html)
print(img_url)

输出结果:

['http://tiebapic.baidu.com/forum/w%3D580/sign=e65b16860d4c510faec4e21250582528/4093d0628535e5dd860e6d9161c6a7efce1b622b.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=15c25c3aba345982c58ae59a3cf5310b/bc8f9982d158ccbfee9e7d740ed8bc3eb135412b.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=21b0d3a3b4cc7cd9fa2d34d109002104/a94fadd3fd1f41341a15cd30321f95cad1c85e2b.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=6e3f55adea1f4134e0370576151e95c1/a21ad41373f08202928e59715cfbfbedab641b35.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=13cdc01342fbb2fb342b581a7f4b2043/56d3b7b7d0a20cf47b8e369161094b36acaf9937.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=404a20ba46da81cb4ee683c56267d0a4/8485b61c8701a18b6af2da01892f07082838fe31.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=ee45c9d649df8db1bc2e7c6c3922dddb/a23fd8c451da81cb8bfe4b834566d01609243131.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=2cea81a79a1001e94e3c1407880f7b06/f24e4434970a304eaa3968b4c6c8a786c9175c32.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=796bac71cef9d72a17641015e42b282a/8a64e0dde71190ef15e15a91d91b9d16fdfa6033.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=bcd12a547759252da3171d0c049a032c/e84095ef76c6a7efab336068eafaaf51f3de663c.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=3898366d5e540923aa696376a259d1dc/8e9a3cdbb6fd5266653ff45abc18972bd407363d.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=b176ed464443fbf2c52ca62b807fca1e/533131a85edf8db1bc39b42a1e23dd54564e743e.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=940d328209178a82ce3c7fa8c602737f/2ff5432309f790528e098f701bf3d7ca7bcbd53f.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=33a329419094a4c20a23e7233ef51bac/bc4ab551f8198618d35127ca5ded2e738bd4e638.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=4fd2645bab315c6043956be7bdb0cbe6/dd4a0bf41bd5ad6e72e8aca496cb39dbb6fd3c38.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=35222487c02a60595210e1121835342d/ea8ce21190ef76c6c3c233658a16fdfaaf516739.jpg', 'http://tiebapic.baidu.com/forum/w%3D580/sign=5ef70fad402c11dfded1bf2b53266255/c90d15385343fbf28e3f6e6da77eca8065388f3a.jpg']

根据获取图片url来保存图片:

# 使用url中的名称作为图片的名称
for i in img_url:
    filename=i.split("/")[-1]
    # 保存在本地
    urllib.request.urlretrieve(i,filename,None) # 该函数将url表示的网络对象复制到本地文件
    print('图片%s 保存成功'%filename)

输出结果:

图片4093d0628535e5dd860e6d9161c6a7efce1b622b.jpg 保存成功
图片bc8f9982d158ccbfee9e7d740ed8bc3eb135412b.jpg 保存成功
图片a94fadd3fd1f41341a15cd30321f95cad1c85e2b.jpg 保存成功
图片a21ad41373f08202928e59715cfbfbedab641b35.jpg 保存成功
图片56d3b7b7d0a20cf47b8e369161094b36acaf9937.jpg 保存成功
图片8485b61c8701a18b6af2da01892f07082838fe31.jpg 保存成功
图片a23fd8c451da81cb8bfe4b834566d01609243131.jpg 保存成功
图片f24e4434970a304eaa3968b4c6c8a786c9175c32.jpg 保存成功
图片8a64e0dde71190ef15e15a91d91b9d16fdfa6033.jpg 保存成功
图片e84095ef76c6a7efab336068eafaaf51f3de663c.jpg 保存成功
图片8e9a3cdbb6fd5266653ff45abc18972bd407363d.jpg 保存成功
图片533131a85edf8db1bc39b42a1e23dd54564e743e.jpg 保存成功
图片2ff5432309f790528e098f701bf3d7ca7bcbd53f.jpg 保存成功
图片bc4ab551f8198618d35127ca5ded2e738bd4e638.jpg 保存成功
图片dd4a0bf41bd5ad6e72e8aca496cb39dbb6fd3c38.jpg 保存成功
图片ea8ce21190ef76c6c3c233658a16fdfaaf516739.jpg 保存成功
图片c90d15385343fbf28e3f6e6da77eca8065388f3a.jpg 保存成功

2.requests模块基础使用

2.1.使用requests模块发送get请求

import requests
# 目标url
url='https://www.baidu.com'
# 向目标url发送get请求
response=requests.get(url)
print(response.status_code) # 200

输出结果:

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç™¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前å¿
读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

content方法查看二进制数据,decode方法可以对二进制数据进行解码,使用此方法查看不会出现乱码。
print(response.content.decode())
输出结果:

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

查看响应头:

print(response.headers)

输出结果:

{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Fri, 03 Jun 2022 13:33:03 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:24:18 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

请求头信息为向目标主机发送请求时携带的浏览器信息等,可以通过查看浏览器中对应加载项的request.header对比。

2.2.发送完整请求

创建headers字段,可以通过浏览器的开发者平台查看headers信息。

import requests
# 构建headers字典
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.30"}
import requests
url='https://www.baidu.com'
response=requests.get(url,headers=headers)
print(response.content.decode())

输出结果:

<!DOCTYPE html><!--STATUS OK-->


    <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#ffffff"><meta name="description" content="全球领先的中文搜索引擎、致力于让网民更便捷地获取信息,找到所求。百度超过千亿的中文网页数据库,可以瞬间找到相关的搜索结果。"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg"><link rel="dns-prefetch" href="//dss0.bdstatic.com"/><link rel="dns-prefetch" href="//dss1.bdstatic.com"/><link rel="dns-prefetch" href="//ss1.bdstatic.com"/><link rel="dns-prefetch" href="//sp0.baidu.com"/><link rel="dns-prefetch" href="//sp1.baidu.com"/><link rel="dns-prefetch" href="//sp2.baidu.com"/><title>百度一下,你就知道</title><style index="newi" type="text/css">#form .bdsug{top:39px}.bdsug{display:none;position:absolute;width:535px;background:#fff;border:1px solid #ccc!important;_overflow:hidden;box-shadow:1px 1px 3px #ededed;-webkit-box-shadow:1px 1px 3px #ededed;-moz-box-shadow:1px 1px 3px #ededed;-o-box-sh
    ......

可以看到加上了headers字段后返回的结果是不同的,因为告诉了网站这是一个正常的浏览器访问,而不是爬虫。

2.3.添加请求参数

在使用get方式发送请求时,传递参数

import requests

# 构建headers字典
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.30"}
import requests

url = 'https://www.baidu.com'
kw = {'wd': 'python'}
response = requests.get(url, params=kw, headers=headers)
print(response.url)

输出结果:

https://www.baidu.com/?wd=python

2.4.get请求方法参数

timeout,超时参数:

import requests
response=requests.get("http://www.baidu.com",timeout=0.01)
print(response)

输出结果:

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.baidu.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x0000020623E5A550>, 'Connection to www.baidu.com timed out. (connect timeout=0.01)'))

添加代理的写法:

import requests
proxies={
    'http':'http://12.34.56.78:9527'
}
response=requests.get('https://www.baidu.com',proxies=proxies)

2.5.post请求

import requests
payload={'key1':'value1','key2':'value2'}
response=requests.post('http://httpbin.org/post',data=payload)
print(response.text)

输出结果:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.27.1", 
    "X-Amzn-Trace-Id": "Root=1-62a5f4bf-281e17cb6d4d4e353df61d86"
  }, 
  "json": null, 
  "origin": "12.34.56.78", 
  "url": "http://httpbin.org/post"
}

2.6.requests实践 - 爬取电影数据

打开网页https://movie.douban.com/explore。爬取里面的电影数据。

import requests
url='https://movie.douban.com/explore'
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.41"}
html=requests.get(url, headers=headers).content.decode()
print(html)

输出:

<!DOCTYPE html>
<html lang="zh-CN" class="ua-windows ua-webkit">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="renderer" content="webkit">
    <meta name="referrer" content="always">
    <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
    <title>
    选电影
</title>
    ......
    <script>_SPLITTEST=''</script>
</body>
</html>

发现输出结果为网页编解码后的文本。其中并没有我们想要获取的数据,通过点击右键,检查,查看网页代码进行对比

发现网页源代码和我们用爬虫获取的HTML代码和加载后是不一样的,造成这种结果的原因是,该网页中的数据是动态加载的。解决方法需要我们对网页加载数据进行分析。
通过开发者平台的NetWork选项去查看网页中的加载项,选中下面的XHR查看动态加载的数据项。

在这些文件中我们看到了想要查找的数据。

找到这些数据后,点击Headers去查看数据的具体信息。

可以看到数据项对应的url地址是:https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=20&page_start=0
接下来我们使用requests模块去请求这个url。

import requests
url="https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=20&page_start=0"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.41"}
html=requests.get(url,headers=headers).content.decode()
print(html)

输出结果:

{"subjects":[{"episodes_info":"","rate":"7.0","cover_x":750,"title":"天才不能承受之重","url":"https:\/\/movie.douban.com\/subject\/34890458\/","playable":false,"cover":"https://img2.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2869531882.jpg","id":"34890458","cover_y":1111,"is_new":false},{"episodes_info":"","rate":"7.0","cover_x":992,"title":"目中无人","url":"https:\/\/movie.douban.com\/subject\/35295405\/","playable":true,"cover":"https://img1.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2873818227.jpg","id":"35295405","cover_y":1389,"is_new":false},{"episodes_info":"","rate":"6.2","cover_x":1080,"title":"爱在托斯卡纳","url":"https:\/\/movie.douban.com\/subject\/35342564\/","playable":false,"cover":"https://img9.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2873227855.jpg","id":"35342564","cover_y":1350,"is_new":true}]}

这次获取的结果就是所需的数据,但是这样的数据并不方便理解和处理。这种格式就是json数据,形状上很像是字典,但不能直接当作字典使用,需要使用Python中的json模块进行转化。

import requests
import json
url="https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=20&page_start=0"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.41"}
html=requests.get(url,headers=headers).content.decode()
# print(html)
# 使用loads方法将数据转化为字典
result=json.loads(html)
print(type(result))
print(result)

输出结果:

<class 'dict'>
{
    "subjects": [
        {
            "episodes_info": "",
            "rate": "7.0",
            "cover_x": 750,
            "title": "天才不能承受之重",
            "url": "https:\/\/movie.douban.com\/subject\/34890458\/",
            "playable": false,
            "cover": "https://img2.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2869531882.jpg",
            "id": "34890458",
            "cover_y": 1111,
            "is_new": false
        },
        {
            "episodes_info": "",
            "rate": "7.0",
            "cover_x": 992,
            "title": "目中无人",
            "url": "https:\/\/movie.douban.com\/subject\/35295405\/",
            "playable": true,
            "cover": "https://img1.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2873818227.jpg",
            "id": "35295405",
            "cover_y": 1389,
            "is_new": false
        },
        {
            "episodes_info": "",
            "rate": "6.2",
            "cover_x": 1080,
            "title": "爱在托斯卡纳",
            "url": "https:\/\/movie.douban.com\/subject\/35342564\/",
            "playable": false,
            "cover": "https://img9.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2873227855.jpg",
            "id": "35342564",
            "cover_y": 1350,
            "is_new": true
        }
    ]
}

2.7.requests实践 - post请求爬取翻译结果

使用post请求实现百度翻译爬虫,目标url:https://fanyi.baidu.com/sug
通过对目标url发送请求,获取对应响应来实现翻译。为了减少其他干扰我们通过浏览器模拟移动端访问。

在页面输入文字过程中可以看到不断弹出sug,请求方式为POST,实现翻译的目标URL为https://fanyi.baidu.com/sug,参数为kw。

import requests
url="https://fanyi.baidu.com/sug"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.41"}
data={}
data["kw"]=input("请输入要翻译的中文:")
response=requests.post(url,data=data,headers=headers)
print(response.content.decode())

输出结果:

请输入要翻译的中文:你好
{"errno":0,"data":[{"k":"\u4f60\u597d","v":"hello; hi; How do you do!"},{"k":"\u4f60\u597d\u5417","v":"How do you do?"},{"k":"\u4f60\u597d\uff0c\u964c\u751f\u4eba","v":"[\u7535\u5f71]Hello Stranger"}]}

获取的结果为json格式的数据,可以通过json模块进行解析。

import requests
import json
url="https://fanyi.baidu.com/sug"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.124 Safari/537.36 Edg/102.0.1245.41"}
data={}
data["kw"]=input("请输入要翻译的中文:")
response=requests.post(url,data=data,headers=headers)
print(response.content.decode())
print("------------------------------")
json_data=json.loads(response.content.decode())
print(json_data)

输出结果:

请输入要翻译的中文:你好
{"errno":0,"data":[{"k":"\u4f60\u597d","v":"hello; hi; How do you do!"},{"k":"\u4f60\u597d\u5417","v":"How do you do?"},{"k":"\u4f60\u597d\uff0c\u964c\u751f\u4eba","v":"[\u7535\u5f71]Hello Stranger"}]}
------------------------------
{'errno': 0, 'data': [{'k': '你好', 'v': 'hello; hi; How do you do!'}, {'k': '你好吗', 'v': 'How do you do?'}, {'k': '你好,陌生人', 'v': '[电影]Hello Stranger'}]}

3.selenium基础使用

本实验需要在含有Google浏览器的环境中进行。本次操作使用Edge浏览器。

3.1.获取chromedriver

进入https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/获取对应自己浏览器的chromedriver。

下载后解压放到Python安装路径下(或者配置环境变量)。

3.2.selenium的基本使用

from selenium import webdriver
driver=webdriver.Edge()
driver.get('http://www.baidu.com')
driver.save_screenshot('test_baidu.png')

获取的截图如下:

查看请求信息:

driver.page_source
driver.get_cookies()
driver.current_url

退出:

driver.close() # 退出当前页面
driver.quit() # 退出浏览器

3.3.selenium的定位操作

案例1:

from selenium import webdriver
from selenium.webdriver.common.by import By
driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret1=driver.find_element(by=By.ID,value='anony-nav')
print(ret1)

输出结果:

<selenium.webdriver.remote.webelement.WebElement (session="b2c4483b978951da3db7d2836c77d513", element="f0b02cf1-9eaa-4d96-9a05-a4d8d6349185")>

案例2:

from selenium import webdriver
from selenium.webdriver.common.by import By
driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret2=driver.find_elements(by=By.ID, value='anony-nav')
print(ret2)

输出结果:

[<selenium.webdriver.remote.webelement.WebElement (session="68d87fdbc41cbc2bdb4a6d16212d8b72", element="0ce91363-1dc1-45b4-b9b9-614cac3f35a6")>]

案例3:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret3=driver.find_elements(by=By.XPATH, value="//*[@id='anony-nav']/h1/a")
driver.quit()
print(len(ret3))

输出结果:

0

案例4:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret4=driver.find_elements(by=By.TAG_NAME, value="h1")
driver.quit()
print(len(ret4))

输出结果:

1

案例5:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret5=driver.find_elements(by=By.LINK_TEXT, value="下载豆瓣 App")
driver.quit()
print(len(ret5))

输出结果:

1

案例6:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret6=driver.find_elements(by=By.PARTIAL_LINK_TEXT, value="豆瓣")
driver.quit()
print(len(ret6))

输出结果:

28

案例7:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret7=driver.find_elements(by=By.TAG_NAME, value="h1")
driver.quit()
print(ret7[0].text)

输出结果:

豆瓣

案例8:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver=webdriver.Edge()
driver.get('http://www.douban.com')
ret8=driver.find_elements(by=By.LINK_TEXT, value="下载豆瓣 App")
print(ret8[0].get_attribute('href'))
driver.quit()

输出结果:

https://www.douban.com/doubanapp/app?channel=nimingye
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。