在SAP HANA Express Edition里进行文本分析

举报
Jerry Wang 发表于 2022/03/28 18:56:55 2022/03/28
【摘要】 这个练习会使用SAP HANA Express Edition的文本语义分析引擎对JSON格式的documents进行语义分析。首先创建一个column table,对其index开启fuzzy text search(模糊搜索)功能。上述描述的操作可以用下面的SQL语句来完成:create column table food_analysis( name nvarchar(64), des...

这个练习会使用SAP HANA Express Edition的文本语义分析引擎对JSON格式的documents进行语义分析。

首先创建一个column table,对其index开启fuzzy text search(模糊搜索)功能。

上述描述的操作可以用下面的SQL语句来完成:

create column table food_analysis
(
	name nvarchar(64),
	description text FAST PREPROCESS ON FUZZY SEARCH INDEX ON
);

其中description字段开启了模糊搜索功能。

将存储于名为doc_store的document store collection里的json key-value键值对拷贝到刚刚创建的数据库表里:

insert into food_analysis
with doc_store as (select "name", "description" from food_collection)
select doc_store."name" as name, doc_store."description" as description
from doc_store;

执行上述的sql语句,确保数据全部拷贝到数据库表food_analysis中:

使用下列的sql语句对description字段进行模糊搜索:

select  name, score() as similarity, TO_VARCHAR(description)
from food_analysis
where contains(description, 'nuts', fuzzy(0.5,'textsearch=compare'))
order by similarity desc

执行结果:

HANA Express Edition里的linguistic 文本分析步骤也比较简单。

首先还是创建一个数据库表:

create column table food_sentiment
(
	name nvarchar(64) primary key,
	description nvarchar(2048)
);

将document store里的json数据拷贝到数据库表里:

insert into food_sentiment
with doc_store as (select "name", "description" from food_collection)
select doc_store."name" as name, doc_store."description" as description
from doc_store;

针对description字段创建一个新的index:

CREATE FULLTEXT INDEX FOOD_SENTIMENT_INDEX ON "FOOD_SENTIMENT" ("DESCRIPTION")
CONFIGURATION 'GRAMMATICAL_ROLE_ANALYSIS'
LANGUAGE DETECTION ('EN')
SEARCH ONLY OFF
FAST PREPROCESS OFF
TEXT MINING OFF
TOKEN SEPARATORS ''
TEXT ANALYSIS ON;

上述SQL语句会自动创建一个名为$TA_FOOD_SENTIMENT_INDEX的文本分析表:
该表里的内容:

由此可以发现,之前我们导入到数据库表里的英文句子,被HANA text engine拆解成单词,并且每个单词的词性也自动被HANA解析出来了。

通过csv文件提供的数据库表内容:

links.csv的格式:

movies.csv格式,一个movie可以有多种风格(genres),通过|分隔:

ratings.csv:

用户给movie打得分:

tags.csv:movie的标签

练习一:

列出四张表的总记录数:

select 'links'   as "table name", count(1) as "row count" from "MOVIELENS"."public.aa.movielens.hdb::data.LINKS"
union all
select 'movies'  as "table name", count(1) as "row count" from "MOVIELENS"."public.aa.movielens.hdb::data.MOVIES"
union all
select 'ratings' as "table name", count(1) as "row count" from "MOVIELENS"."public.aa.movielens.hdb::data.RATINGS"
union all
select 'tags'    as "table name", count(1) as "row count" from "MOVIELENS"."public.aa.movielens.hdb::data.TAGS";

执行结果:

练习2:计算总共9125部电影,一共包含多少艺术类别?

DO
BEGIN
  DECLARE genreArray NVARCHAR(255) ARRAY;
  DECLARE tmp NVARCHAR(255);
  DECLARE idx INTEGER;
  DECLARE sep NVARCHAR(1) := '|';
  DECLARE CURSOR cur FOR SELECT DISTINCT "GENRES" FROM "MOVIELENS"."public.aa.movielens.hdb::data.MOVIES";
  DECLARE genres NVARCHAR (255) := '';
  idx := 1;
  FOR cur_row AS cur() DO
    SELECT cur_row."GENRES" INTO genres FROM DUMMY;
    tmp := :genres;
    WHILE LOCATE(:tmp,:sep) > 0 DO
      genreArray[:idx] := SUBSTR_BEFORE(:tmp,:sep);
      tmp := SUBSTR_AFTER(:tmp,:sep);
      idx := :idx + 1;
    END WHILE;
    genreArray[:idx] := :tmp;
  END FOR;

  genreList = UNNEST(:genreArray) AS ("GENRE");
  SELECT "GENRE" FROM :genreList GROUP BY "GENRE";
END;

执行结果,总共包含18种:

练习3:计算每种艺术类别总共包含多少部电影:

DO
BEGIN
  DECLARE genreArray NVARCHAR(255) ARRAY;
  DECLARE tmp NVARCHAR(255);
  DECLARE idx INTEGER;
  DECLARE sep NVARCHAR(1) := '|';
  DECLARE CURSOR cur FOR SELECT DISTINCT "GENRES" FROM "MOVIELENS"."public.aa.movielens.hdb::data.MOVIES";
  DECLARE genres NVARCHAR (255) := '';
  idx := 1;
  FOR cur_row AS cur() DO
    SELECT cur_row."GENRES" INTO genres FROM DUMMY;
    tmp := :genres;
    WHILE LOCATE(:tmp,:sep) > 0 DO
      genreArray[:idx] := SUBSTR_BEFORE(:tmp,:sep);
      tmp := SUBSTR_AFTER(:tmp,:sep);
      idx := :idx + 1;
    END WHILE;
    genreArray[:idx] := :tmp;
  END FOR;

  genreList = UNNEST(:genreArray) AS ("GENRE");
  SELECT "GENRE", count(1) FROM :genreList GROUP BY "GENRE";
END;

练习4:列出每部电影包含的风格数目:

SELECT
    "MOVIEID"
  , "TITLE"
  , OCCURRENCES_REGEXPR('[|]' IN GENRES) + 1 "GENRE_COUNT"
  , "GENRES"
FROM "MOVIELENS"."public.aa.movielens.hdb::data.MOVIES"
ORDER BY "GENRE_COUNT" ASC;

练习5:罗列出每部电影的风格分布情况

SELECT
    "GENRE_COUNT"
  , COUNT(1)
FROM (
  SELECT
    OCCURRENCES_REGEXPR('[|]' IN "GENRES") + 1 "GENRE_COUNT"
  FROM "MOVIELENS"."public.aa.movielens.hdb::data.MOVIES"
)
GROUP BY "GENRE_COUNT" ORDER BY "GENRE_COUNT";

比如至少拥有1个风格的电影,有2793部,2个风格的电影有3039部,等等。

练习6:计算movie的rating分布情况

SELECT DISTINCT
  MIN("RATING_COUNT") OVER( ) AS "MIN",
  MAX("RATING_COUNT") OVER( ) AS "MAX",
  AVG("RATING_COUNT") OVER( ) AS "AVG",
  SUM("RATING_COUNT") OVER( ) AS "SUM",
  MEDIAN("RATING_COUNT") OVER( ) AS "MEDIAN",
  STDDEV("RATING_COUNT") OVER( ) AS "STDDEV",
  COUNT(*) OVER( ) AS "CATEGORY_COUNT"
FROM (
  SELECT "MOVIEID", COUNT(1) as "RATING_COUNT"
  FROM "MOVIELENS"."public.aa.movielens.hdb::data.RATINGS"
  GROUP BY "MOVIEID"
)
GROUP BY "RATING_COUNT";

明细情况:


SELECT "RATING_COUNT", COUNT(1) as "MOVIE_COUNT"
FROM (
  SELECT "MOVIEID", COUNT(1) as "RATING_COUNT"
  FROM "MOVIELENS"."public.aa.movielens.hdb::data.RATINGS"
  GROUP BY "MOVIEID"
)
GROUP BY "RATING_COUNT" ORDER BY "RATING_COUNT" asc;

比如有397部电影的用户投票数为5票

练习7:统计用户投票情况

SELECT "RATING_COUNT", COUNT(1) as "USER_COUNT"
FROM (
  SELECT "USERID", COUNT(1) as "RATING_COUNT"
  FROM "MOVIELENS"."public.aa.movielens.hdb::data.RATINGS"
  GROUP BY "USERID"
)
GROUP BY "RATING_COUNT" ORDER BY 1 DESC;

有一位用户投了2391票,一位用户投了1868票:

练习8:统计用户投票得分情况

SELECT "RATING", COUNT(1) as "RATING_COUNT"
FROM "MOVIELENS"."public.aa.movielens.hdb::data.RATINGS"
GROUP BY "RATING" ORDER BY 1 DESC;

有15095份用户投票,打的分数是5分

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。