- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

使用ICTCLAS2015进行分词

远航 | FIBOS 发表于 2020/12/02 01:05:29 2020/12/02

【摘要】使用ICTCLAS2015进行分词在今年的Imagine Cup中使用到了语义分析的部分，其中需要分词作为基础，我是用的是中科院的ICTCLA2015，本篇博客我来讲讲如何使用ICTCLAS2015进行分词 ICTCLAS2015 简介中文词法分析是中文信息处理的基础与关键。中国科学院计算技术研究所在多年研究工作积累的基础上，研制出了汉语词法分析系...

使用ICTCLAS2015进行分词

在今年的Imagine Cup中使用到了语义分析的部分，其中需要分词作为基础，我是用的是中科院的ICTCLA2015，本篇博客我来讲讲如何使用ICTCLAS2015进行分词

ICTCLAS2015

简介

中文词法分析是中文信息处理的基础与关键。中国科学院计算技术研究所在多年研究工作积累的基础上，研制出了汉语词法分析系统ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)，主要功能包括中文分词；词性标注；命名实体识别；新词识别；同时支持用户词典。先后精心打造五年，内核升级6次，目前已经升级到了ICTCLAS3.0。ICTCLAS3.0分词速度单机996KB/s，分词精度98.45%，API不超过200KB，各种词典数据压缩后不到3M，是当前世界上最好的汉语词法分析器。

下载地址

http://ictclas.nlpir.org/downloads

使用ICTCLAS2015进行开发

本文所采用开发平台

操作系统：Windows 8.1 x64
开发语言：Java
开发工具：Eclipse

开发实例

准备

复制Data文件夹和NLPIR.dll至开发目录

下载JNA类库， jna-platform-4.1.0.jar

使用JNA调用C++接口

 //定义JNA接口 public interface CLibrary extends Library{ //建立实例 CLibrary Instance = (CLibrary)Native.loadLibrary("./libs/NLPIR", CLibrary.class); //系统初始化 public int NLPIR_Init(byte[] sDataPath, int encoding,byte[] sLicenceCode); //段落处理 public String NLPIR_ParagraphProcess(String sSrc, int bPOSTagged); //获取关键词 public String NLPIR_GetKeyWords(String sLine, int nMaxKeyLimit,boolean bWeightOut); //退出函数 public void NLPIR_Exit(); //文档处理 public double NLPIR_FileProcess(String sSourceFilename,String sResultFilename,int bPOStagged); //引入用户自定义词典 public int NLPIR_ImportUserDict(String sFilename,Boolean bOverwrite); //添加用户新词并标注词性 public int NLPIR_AddUserWord(String sWords); }
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19

对一段文字进行分词，返回标注词性的分词结果

 /** * 对一段文字进行分词，返回标注词性的分词结果 * * @param fileName * @return words * @throws Exception */ public static String[] Segment(String fileName) throws Exception{ //保存分词结果 String result[]={"",""}; String sourceString = ""; //从文件中读入文本 try { String encoding="UTF-8"; File file=new File(fileName); if(file.isFile() && file.exists()){ //判断文件是否存在 String temp = null; InputStreamReader read = new InputStreamReader(new FileInputStream(file),encoding); BufferedReader bufferedReader = new BufferedReader(read); while((temp = bufferedReader.readLine()) != null){ sourceString += temp; } read.close(); }else{ System.out.println("找不到指定的文件"); } } catch (Exception e) { System.out.println("读取文件内容出错"); e.printStackTrace(); } //进行分词，对NLPIR初始化 String argu = ""; String system_charset = "UTF-8"; int charset_type = 1; int init_flag = CLibrary.Instance.NLPIR_Init(argu.getBytes(system_charset), charset_type, "1".getBytes(system_charset)); AddUserWords("dic/dic.txt"); if(0 == init_flag){ System.out.println("init fail!"); return null; } //保存分词结果  String nativeBytes = null; //保存关键词 String nativeByte = null; try{ //分词 nativeBytes = CLibrary.Instance.NLPIR_ParagraphProcess(sourceString, 1); //获取关键词 nativeByte = CLibrary.Instance.NLPIR_GetKeyWords(sourceString, 5, true); }catch(Exception e){ e.printStackTrace(); } result[0] = nativeBytes; result[1] = nativeByte; //返回分词结果 return result; }
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64

添加用户词典并进行词性标注

 /** * 添加用户词典并进行词性标注 * @param filePath */ public static void AddUserWords(String filePath){ try{ String encoding = "UTF-8"; File file = new File(filePath); if(file.isFile()&&file.exists()){ InputStreamReader read = new InputStreamReader(new FileInputStream(file), encoding); BufferedReader bufferReader = new BufferedReader(read); String lineText = ""; while((lineText = bufferReader.readLine()) != null){ CLibrary.Instance.NLPIR_AddUserWord(lineText); } } else{ System.out.println("未找到文件！"); } }catch(Exception e){ e.printStackTrace(); } }
  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24

文章来源: blog.csdn.net，作者：冰水比水冰，版权归原作者所有，如需转载，请联系作者。

原文链接：blog.csdn.net/luoyhang003/article/details/44586731

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

使用ICTCLAS2015进行分词

使用ICTCLAS2015进行分词

ICTCLAS2015

简介

下载地址

使用ICTCLAS2015进行开发

本文所采用开发平台

开发实例

准备

使用JNA调用C++接口

对一段文字进行分词，返回标注词性的分词结果

添加用户词典并进行词性标注

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品