- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

spaCy使用

毛利发表于 2021/07/15 07:50:57 2021/07/15

【摘要】官方文档 https://spacy.io/usage spaCy是一个Python自然语言处理工具包，诞生于2014年年中，号称“Industrial-Strength Natural Language Processing in Python”，是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能，这个区别于学术...

官方文档
https://spacy.io/usage

spaCy是一个Python自然语言处理工具包，诞生于2014年年中，号称“Industrial-Strength Natural Language Processing in Python”，是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能，这个区别于学术性质更浓的Python NLTK，因此具有了业界应用的实际价值。

加载模型

# 导入工具包和英文模型
# python -m spacy download en 用管理员身份打开CMD

import spacy
nlp = spacy.load('en')

  
 
  1
  2
  3
  4
  5

文本处理

doc = nlp('Weather is good, very windy and sunny. We have no classes in the afternoon.')
# 分词
for token in doc: print (token)
OUT:
 Weather
is
good
,
very
windy
and
sunny
.
We
have
no
classes
in
the
afternoon
---------------------------------
#分句
for sent in doc.sents: print (sent)
OUT：
Weather is good, very windy and sunny.
We have no classes in the afternoon. 
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28

词性参考 https://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/

for token in doc: print ('{}-{}'.format(token,token.pos_))
OUT:
Weather-PROPN
is-VERB
good-ADJ
,-PUNCT
very-ADV
windy-ADJ
and-CCONJ
sunny-ADJ
.-PUNCT
We-PRON
have-VERB
no-DET
classes-NOUN
in-ADP
the-DET
afternoon-NOUN
.-PUNCT 

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20

命名体识别

doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents: print ('{}-{}'.format(ent,ent.label_))
OUT:
Paris-GPE
Jack-PERSON 
----
from spacy import displacy
doc = nlp('I went to Paris where I met my old friend Jack from uni.')
displacy.render(doc,style='ent',jupyter=True)

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10

练习：找到书中所有人物名字

def read_file(file_name): with open(file_name, 'r') as file: return file.read()
# 加载文本数据
text = read_file('./data/pride_and_prejudice.txt')
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print (len(sentences))
OUT:
6469

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10

一共有6469个句子

from collections import Counter,defaultdict
def find_person(doc): c = Counter() for ent in processed_text.ents: if ent.label_ == 'PERSON': c[ent.lemma_]+=1 return c.most_common(10)
print (find_person(processed_text))
OUT:
[('elizabeth', 604), ('darcy', 276), ('jane', 274), ('bennet', 233), ('bingley', 189), ('collins', 179), ('wickham', 170), ('gardiner', 95), ('lizzy', 94), ('lady catherine', 77)]

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10

搞定

文章来源: maoli.blog.csdn.net，作者：刘润森！，版权归原作者所有，如需转载，请联系作者。

原文链接：maoli.blog.csdn.net/article/details/88931036

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

spaCy使用

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

spaCy使用

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品