- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

Python爬虫：python2使用scrapy输出unicode乱码

彭世瑜发表于 2021/08/14 00:51:08 2021/08/14

【摘要】无力吐槽的python2，对中文太不友好了，不过在早期项目中还是需要用到没办法，还是需要解决我编写scrapy爬虫的一般思路：创建spider文件和类编写parse解析函数，抓取测试，将有用信息输出到控制台在数据库中创建数据表编写item编写model（配合pipline将item写入数据库）编写pipline运行爬虫项目，测试保存的数据正确性在第2步抓...

无力吐槽的python2，对中文太不友好了，不过在早期项目中还是需要用到

没办法，还是需要解决

我编写scrapy爬虫的一般思路：

创建spider文件和类
编写parse解析函数，抓取测试，将有用信息输出到控制台
在数据库中创建数据表
编写item
编写model（配合pipline将item写入数据库）
编写pipline
运行爬虫项目，测试保存的数据正确性

在第2步抓取测试的时候，我并没有创建数据库（因为我感觉在数据库中创建数据表比较麻烦，考虑的因素比较多），并不能保存数据到数据库，直接输出到控制台又不能很好地看到数据的整体效果

一个解决办法就是利用scrapy提供的数据导出中间件，将抓取的数据导出到json或者scv文件中

$ scrapy crawl spider_name -o person.json
  
 
  1

额，python2。。。我的天，抓取的数据大概是这样的

[
{"name": "\u5f20\u4e39"},
{"name": "\u77bf\u6653\u94e7"},
{"name": "\u95eb\u5927\u9e4f"},
{"name": "\u9c8d\u6d77\u660e"},
{"name": "\u9648\u53cb\u658c"},
{"name": "\u9648\u5efa\u5cf0"}
]
  
 
  1
  2
  3
  4
  5
  6
  7
  8

好吧，英文能看懂，中文反而看不懂了，简直不能忍

接下来对它做点什么

1、找到scrapy默认配置文件

# scrapy.settings.default_settings

FEED_EXPORTERS_BASE = { 'json': 'scrapy.exporters.JsonItemExporter', 'jsonlines': 'scrapy.exporters.JsonLinesItemExporter', 'jl': 'scrapy.exporters.JsonLinesItemExporter', 'csv': 'scrapy.exporters.CsvItemExporter', 'xml': 'scrapy.exporters.XmlItemExporter', 'marshal': 'scrapy.exporters.MarshalItemExporter', 'pickle': 'scrapy.exporters.PickleItemExporter',
}
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11

2、看到json的导出类，按照路径找到这个类

# scrapy.exporters.JsonItemExporter

class JsonItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): self._configure(kwargs, dont_fail=True) self.file = file self.encoder = ScrapyJSONEncoder(**kwargs) self.first_item = True def start_exporting(self): self.file.write(b"[\n") def finish_exporting(self): self.file.write(b"\n]") def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(b',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(to_bytes(self.encoder.encode(itemdict)))
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23

看到最下面一句，写入文件，后面还对其进行了编码，我们就在这里做工作

3、改写JsonItemExporter
方法1：

import json

class MyJsonItemExporter(JsonItemExporter): def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(b',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(json.dumps(itemdict, ensure_ascii=False))
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10

继承原有的JsonItemExporter类，将最下面的写入文件代码修改即可，这种方式比较直观，也比较简单

方式2：
我们注意到JsonItemExporter中的初始化函数有一个属性

self.encoder = ScrapyJSONEncoder(**kwargs)
  
 
  1

下面写入的时候也用到了，顺藤摸瓜，依次找到下面两个类，部分代码省略

class ScrapyJSONEncoder(json.JSONEncoder): pass

class JSONEncoder(object): def __init__(self, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None):
  
 
  1
  2
  3
  4
  5
  6
  7
  8

这样看来，我们也可以这么改写

class MyJsonItemExporter(JsonItemExporter): def __init__(self, file, **kwargs): super(MyJsonItemExporter, self).__init__( file, ensure_ascii=False, **kwargs )
  
 
  1
  2
  3
  4
  5

仅仅只是添加了ensure_ascii=False，这样看起来，逼格就高了许多

4、使用MyJsonItemExporter
可以在爬虫中单独设置，也可以设置在全局settings里边

 custom_settings = { "FEED_EXPORTERS_BASE":{ "json": "MyJsonItemExporter" }
}
  
 
  1
  2
  3
  4
  5

再次运行爬虫，这次我能看懂中文了

[
{"name": "张丹"},
{"name": "闫大鹏"},
{"name": "瞿晓铧"},
{"name": "鲍海明"},
{"name": "陈友斌"},
{"name": "陈建峰"}
]
  
 
  1
  2
  3
  4
  5
  6
  7
  8

参考
scrapy避免直接输出unicode

文章来源: pengshiyu.blog.csdn.net，作者：彭世瑜，版权归原作者所有，如需转载，请联系作者。

原文链接：pengshiyu.blog.csdn.net/article/details/82020254

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

Python爬虫：python2使用scrapy输出unicode乱码

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

Python爬虫：python2使用scrapy输出unicode乱码

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品