hbases入门之大数据

举报
没头脑 发表于 2022/06/17 21:23:32 2022/06/17
【摘要】 单词billed by traffic 按流量计费contain synchronization information 包含同步信息filter query 过滤查询tenant management 租户管理nic 网卡simulator 模拟器heterogeneous storage 异构存储 问题MRS创建用户/角色,添加hive admin的权限创建hive的外部表,加载数据并统...

单词

billed by traffic 按流量计费

contain synchronization information 包含同步信息

filter query 过滤查询

tenant management 租户管理

nic 网卡

simulator 模拟器

heterogeneous storage 异构存储

问题
  1. MRS创建用户/角色,添加hive admin的权限
  2. 创建hive的外部表,加载数据并统计行数
  3. 将旧的e-store a 替换为e -store b
  4. OceanStor DevcieManager使用华为最新的
HDFS,hive以及hbase基础命令

执行mapreduce任务

hadoop jar /opt/data/hadoop-mapreduce-examples-2.8.3.jar wordcount /user/ict01/item.log /user/ict01/output
                           目录下的jar包                   jar包的主类

国赛代码

首先初始化:source /opt/client/bigdata_env

从本地上传文件到hdfs中/ict01下
hadoop dfs -put /data/employee.txt /ict01


在hdfs中查看前5行的文件内容
hadoop dfs -cat /ict01/employee.txt | head 5

hive中创建表
create table employee(
user_id int,
usename string,
dept_id int)
row formate delimited fields teminated by ','
stored as textfile;


查询hive中5条数据
select * from department limit 5

select e.username,d.dept_name,s.salarys 
from cx_table_employee e 
join cx_table_department d 
on e.dept_id = d.dept_id 
join cx_table_salary s 
on e.user_id = s.userid


从本地源中加载数据到hive表中
load data inpath '/ict01/department.txt' into table department.txt;

创建外部表employee
create external table employee(id int,name string ,subject string,score  float)  row format delimited fields terminated by ',' stored as textfile ;


create table employee(
user_id int,
username string,
dept_id int)
row format delimited fields terminated by ','  
stored as textfile ;

create table department(
dept_id int,
dept_name string) 
row format delimited fields terminated by ','  
stored as textfile ;

select * from department  limit 5;
select * from  employee  limit 5;

多张表进行内连接操作时,只有所有表中与on条件中相匹配的数据才会显示。例如下面的SQL实现了每个员工所在的部门,employee表和dept表连接,on条件是dept_id,只有dept_id一样的数据才会匹配并显示出来。
执行语句:
select e.username,e.dept_id,d.dept_name,d.dept_id
from employee e join department d 
on e.dept_id = d.dept_id
where e.dept_id = 1 ;

where e.dept_id' = 1 and 'd.dept_id' = 1;

select * from cx_stu02 where gender ='male' limit 2;

select e.username,e.dept_id,d.dept_name,d.dept_id
from employee e join department d 
on e.dept_id = d.dept_id
limit 5 ;



可以对两张以上的表进行连接操作,下面的SQL语句查询员工的名字、部门名字及其的薪水:
执行语句:
select e.username,d.dept_name,s.salarys 
from cx_table_employee e 
join cx_table_department d 
on e.dept_id = d.dept_id 
join cx_table_salary s 
on e.user_id = s.userid;


全球赛操作命令

数据预处理:剔除数据中任意字段(userid,dept_id,salarys)为空值的异常值

INSERT OVERWRITE TABLE result01
	select * from salary
	where userid is not null and dept_id is not null and salarys is not null

剔除identity字段数值在0-100以外的值

INSERT OVERWRITE TABLE result
	SELECT * FROM hittable
	WHERE identity is not between 0 and 100 ;

查看总数

select count(*) from item;

查询提供啤酒或葡萄酒的第一项产品(Item01)的总行数

select count(*) as num from item where item01 ='Beer' or item01 = 'Wine';

查询第一次购买产品中最畅销的产品排名(item01)

select item01,count(item01) as num from item 
group by item01 
order by num desc;

查询牛奶出现在每一行购买供品的概率

select b.num/a.num as rate from(
select count(*) num from item) a,
(select count(*) num from item
where item01=='Milk' 
or item02=='Milk' 
or item02=='Milk'
or item03=='Milk'
or item04=='Milk') b ;

Query the probability that Milk appears in every line purchased offerings, take a screenshot, and save it as 4.4.1

收集有关出售产品数量的统计数据,并对产品进行分类。分类规则如下

item total sort
select b.item,b.total,case
when total<800 then 'cold'
when total>=800 and total <1000 then 'warm'
else 'hot'
end as sort from(
select item,sum(num) total from(
select item01 as item,count(item01)as num from item group by item01
union all
select item02 as item,count(item02) as num from item group by item02
union all
select item03 as item,count(item03) as num from item group by item03
union all
select item04 as item,count(item04) as num from item group by item04) a group by item) b;

在Hive查询期间,将前面的查询结果保存为结果表,在结果表中查询数据

create table result as select b.item,b.total,case
when total<800 then 'cold'
when total>=800 and total <1000 then 'warm'
else 'hot'
end as sort from(
select item,sum(num) total from (
select item01 as item,count(item01) as num from item group by item01
union all
select item02 as item,count(item02) as num from item group by item02
union all
select item03 as item,count(item03) as num from item group by item03
union all
select item04 as item,count(item04) as num from item group by item04) a group by item) b;

任务1:相似度检测
查询result表中,碱基对(aliLen)长度介于29800-29900之间的个数,截图并保存为3-1-alilen。查询result表中,按照匹配程度(identity)进行降序排序,显示sav,identity,bitScore这三个字段的前十行数据,截图并保存为3-2-bitscore。
任务2:分组统计
分组统计result表中 gapOpens 的次数并进行降序排序,显示字段包含gapOpens及对应的次数( gapNum),显示前5行数据,截图并保存为3-3-top5。

select * from result where aliLen between 29800 and 29900
select count(*) from result as gapNum group by gapOpens order by gapNum desc limit 5

查询result表中,统计qstart=sstart 且 qend=send的数据个数以及qstart!=sstart或qend!=send的数据个数,截图并保存结果为3-4-start。
查询结果输出形式如下:

equal num
yes ***
no ***
select yes as equal ,count(1) as num from result where qstart=sstart and qend=send union all
select no as equal ,count(1) as num  from result where qstart!=sstart or qend!=send

查询result 表中,分组统计匹配度在0%-60%,60%-70%,70%-80%,80%-90%,90%-100%,并按照等级个数进行升序排序,其中0%-60%为E等级,60%-70%为D等级,70%-80%为C等级,80%-90%为B等级,90%-100%为A等级,截图并保存为3-5-group。
查询结果输出形式如下(以下根据实际情况显示):

identity total
A **
B **
C **
D **
E **
select identity,count(1) as total  from (
select  case when total>=90 and total<100 then ‘A’
when total>=80 and total<90 then ‘B’
when total>=70 and total<80 then ‘C’
when total>=60 and total<90 then ‘D’
else ‘E’
end as identity from result )
group by identity ,order by identity
HBase

创建从HBASE到Hive的内部表映射,HBASE表名为HBASE_ExamlID,而单元表名为Hive_ExamID。将结果表中的数据导出到Hive表Hive_ExamID和HBase表HBASE_ExamID,并查询HBASE_ExamID中的数据

​ 建议使用hue中的sqoop

全球赛

将数据记录插入到表中。当成功插入数据记录时,查询数据记录,采取screen快照,并将其保存为5.2.1-INSERT。

put 'hbase_ict001','chips','cf1:total','635'
put 'hbase_ict001','chips','cf1:sort','warm'
get 'hbase_ict001','chips'

根据前面的分类规则(3.2.3.4.4,Exercise 2)更正插入数据的类别。

get 'hbase_ict001','chips'

查询表中warm的数据

scan 'hbase_ict001',(FILTER=>"ValueFilter(=,'binary:warm')")

国赛

创建一个Hive表hive_examID用于映射HBase表hbase_examID,字段包含
sav,identity,aliLen,mismatches,evalue,bitScore六个属性,并将result表中的数据导入。导入成功后,查询前5行数据,截图并保存为4-1-hive_examID。

CREATE EXTERNAL TABLE hive_examID(
key string comment "hbase rowkey"
sav string,
identity decimal,
aliLen int,
mismatches  int,
evalue int,
bitScore int
)STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,record:sav,record:identity,record:aliLen,record:mismatches,record:evalue,record:bitScore") 
TBLPROPERTIES("hbase.table.name" = "hbase_examID");

#先获取 hive的元数据
ret=$(hive -e 'show hive_examID;' | grep -v _es | grep -v _hb | grep -v importinfo)

导出数据到hdfs

hive -e "export  hive_examID $tem to '/tmp/hive-export/$tem';"
hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] //tmp/hive-export/myoutput hbase_examID

查询HBase下hbase_examID表中的前三行数据,截图并保存数据显示结果为4-4-head3。

查询hbase_examID表中的数据总行数,截图并保存为4-5-sum。

scan 'hbase_ict002',{'LIMIT'=>3}
count  'hbase_ict002'

由于信息录入人员输入失误,现需要修改HBase的 hbase_examID表中,rowkey为AY394998.3的列信息,请将列族info下 bitScore所对应的列值修改为13936,修改完成后,查询该rowkey下对应的列信息

#put ‘表名’,‘rowkey’,‘列簇:列名’,‘值’
put 'hbase_examID','AY394998.3','info:bitScore','13936'
#指定时间戳删除
delete 'test:user', '1', 'b:name', 1611678810121 # 注:如果不指定时间戳,删除的就是 Jack

查询hbase_examID表中相似度为100%(即 identity为100.0)的所对应的主键标识(sav),截图并保存查询结果为4-7-filter。

get 'hbase_examID','identity=100.0'

存储初始化

#yum安装初始化
yum install iscsi-initiator-utils

#需要在安装iSCSI之后安装ultra path

#分配存储资源,在linux上执行target的发现,扫远端的
iscsiadm -m discovery -t st -p 10.0.2.101:3260
#得出IQN标识
#10.0.2.101:3260,513_iqn.2014-08.com.example::2100020000040506::20200:10.0.2.101


#登录连接target,IQN号是启动器的标识,通过上一步得到.产生一个配置,之后通过-p连接
iscsiadm -m node -T iqn.2014-08.com.example::2100020000040506::20200:10.0.2.101 -p 10.0.2.101:3260 -l(L)

#从此就有iSCSI启动器,然后去创建主机,主机地址由IQN号得到。为主机server增加启动器

#然后映射到lun组(刚刚创建的),然后创建端口组,选择p0或者p1均可

iscsiadm -m session

#查看iSCSI session信息,刷新lun,可以看到挂上设备的信息
 iscsiadm -m session -R
 
 #lsblk查看存储 lscsi 查看映射的设备

AI代码

confirmed = confirmed_df.loc[:,cols[4] :cols[-1]]
deaths = deaths_df. loc[:,cols[4]:cols[-1]]
recoveries = recoveries_df. loc[:,cols[4]:cols[-1]]

#建立对应日期列表
dates = confirmed. keys()
world_cases =[] total_deaths =[] mortality_rate =[] total_recovered =[]
for i in dates:
confirmed_sum = confirmed[i].sum() 
death_sum = deaths[i].sum)
recovered_sum = recoveries[i].sum()
world_cases.append(confirmed_sum)
total_deaths.append(death_sum)
mortality_rate.append(death_sum/confirmed_sum)
total_recovered.append (recovered_sum)I
days_since_1_22 = np.array([i for i in range(len(dates))]).reshape(-11)world_cases = np. array(world_cases). reshape(-11)
total_deaths = np.array(total_deaths).reshape(-11)
total_recovered = np.array(total_recovered).reshape(-11)print("22confirmed:",str(int (world_cases[o][0])),"人")print("22deaths: " ,str(int(total_deaths[o][0])),"人")
print("22recovered:",str(int(total_recovered[0][0])),"人")print("29confirmed: " ,str(int(world_cases[-1][0])),"人")print("29deaths:",str(int(total_deaths[-1][o])),"人")
print("29recovered:" , str(int(total_recovered[-1][0])),"人")

from sklearn.model_selection import RandomizedSearchCV, train_test_splitdays_in_future = 3
future_forcast = np. array([i for i in range(len(dates)+days_in_future)]).reshape(-11)start = '1/22/2020'
start_date = datetime.datetime.strptime(start,'%m/%d/%Y') Ifuture_forcast_dates =[]
for i in range (len(future_forcast)):
future_forcast_dates.append((start_date + datetime.timedelta(days=i)).strftime(' %m/%d/%Y’#print (future_forcast_dates)
adjusted_dates = future_forcast_dates[:-3]
X_train_confirmed,X_test_confirmed,y_train_confirmed,y_test_confirmed = train_test_split(days_since_l_22,

import numpy as np
days_since_1_22= np.array([i for i in range(len(dates))]).reshape(-11)world_cases = np.array(world_cases).reshape(-11)
total_deaths = np. array(total_deaths).reshape(-11)
total_recovered = np.array(total_recovered).reshape(-11)
days_in_future = 15
future_forcast = np.array([i for i in range(len(dates)+ days_in_future)]).reshape(-11)
adjusted_dates = future_forcast[:-15]
start = '1/22/2020'
start_date = datetime. datetime.strptime(start,'%m/%d/%Y')future_forcast_dates = []
for i in range(len(future_forcast)):
future_forcast_dates. append((start_date + datetime.timedelta(days=i)).strftime (' %mn/%d/%Y'))
X_train_confirmed,X_test_confirmed,y_train_confirmed,y_.test_confirned =sklearn.model_selection.train_test_split (days_since_l_22,
world_cases,test_size=0.15,shuffle=False)
def preprocess(array) :
from sklearn.preprocessing import MinMaxScaler
scl =MinMaxScaler ()
array =sc1.fit_transforn(array)
return array,sc1
days_since_1_22 = np.array([i for i in range(len(dates))]).reshape(-11)print(days_since_1_22)
normalize_data,scl = preprocess(days_since_1_22)
print (normalize_data[1])
def sample_to_data(data,look_back=40,forward_days=10,jump=1):
X,Y =[][]
for i in range(0len(data)- look_back - forward_days + 1,jump):
X.append(data[i:(i + look_back)])
Y.append(data[(i + look_back): (i + look_back + forward_days)])
Y = np.squeeze(Y,axis=(2,))
return np.array(X),np.array(Y)
look_back=40
forward_days=10
x_train,y_train = sample_to_data(normalize_data,look_back,forward_days)print(x_train[1])
x_test,y_test = sample_to_data(normalize_data,look_back,forward_days)
def build_model(1ook_back,N1,N2,forward_days) :
input_layer = tf.keras.layers.Input (shape=(look_back,1))
grul = tf.keras. layers.GRU(units=N1,activation='relu',return_sequences=True)(input_layer)
gru2 = tf.keras. layers.GRU(units=N2)(grul)
dense = tf.keras.layers.Dense(units=forward_days)(gru2)
model = tf.keras.models. Model(inputs=input_layer,outputs=dense)model.compile(loss='mean_squared_error', optimizer='adam’)
model.summaryo
return model
model = build_model(1ook_back,80100,forward_days)
model.fit(x_train,y_train,epochs=100,shuffle-True,batch_size=30, verbose=2)
model.save('gru_mode1.h5’)

x_test.shape,y_test.shape
from tensorflow.keras.models import load_model
model=load_model(' gru_model.h5')
y _test_pred = model.predict(x=x_test)y_test _pred. shape
plt.plot(y_test[o],label='true')
plt.plot(y_test_pred[0],label=' pred')plt.legend
print(y_test[0].shape)
print('MAE:', mean_absolute_error(y_test_pred[0],y_test[0]))
print('MSE:',mean_squared_error(y_test_pred[O], y_test[oj))




【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。