- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

HBase查询一张表的数据条数的方法

WHYBIGDATA 发表于 2023/01/22 14:39:18 2023/01/22

【摘要】 HBase查询一张表的数据条数的方法

HBase查询一张表的数据条数的方法

0、写在前面

Linux版本：Ubuntu Kylin 16.04
Hadoop版本：Hadoop-2.7.2
Zookeeper版本：HBase自带
HBase版本：HBase-1.1.5
Hive版本：Hive-2.1.0

1、HBase-Shell的count命令

hbase(main):017:0> help 'count'
Count the number of rows in a table.  Return value is the number of rows.
This operation may take a LONG time (Run '$HADOOP_HOME/bin/hadoop jar
hbase.jar rowcount' to run a counting mapreduce job). Current count is shown every 1000 rows by default. Count interval may be optionally specified. Scan caching is enabled on count scans by default. Default cache size is 10 rows.If your rows are small in size, you may want to increase this
parameter. Examples:

 hbase> count 'ns1:t1'
 hbase> count 't1'
 hbase> count 't1', INTERVAL => 100000
 hbase> count 't1', CACHE => 1000
 hbase> count 't1', INTERVAL => 10, CACHE => 1000

The same commands also can be run on a table reference. Suppose you had a reference
t to table 't1', the corresponding commands would be:

 hbase> t.count
 hbase> t.count INTERVAL => 100000
 hbase> t.count CACHE => 1000
 hbase> t.count INTERVAL => 10, CACHE => 1000

可以看到「使用count查询表的数据条数」这个操作可能需要消耗过长时间（运行’$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount’ 来运行计数 mapReduce 作业）。

默认情况下，当前计数每 1000 行显示一次。可以选择指定计数间隔。默认情况下，对计数扫描启用扫描缓存。默认缓存大小为 10 行。

2、Scan操作获取数据条数

通过Java API的方式，使用scan进行全表扫描，循环计数RowCount，速度较慢！但快于第一种count方式！

基本代码如下：

public void rowCountByScanFilter(String tablename){
    long rowCount = 0;
    try {
        // 计时
        StopWatch stopWatch = new StopWatch();
        stopWatch.start();

        TableName name=TableName.valueOf(tablename);
        // connection为类静态变量
        Table table = connection.getTable(name);
        Scan scan = new Scan();
        // FirstKeyOnlyFilter只会取得每行数据的第一个kv，提高count速度
        scan.setFilter(new FirstKeyOnlyFilter());
        
        ResultScanner rs = table.getScanner(scan);
        for (Result result : rs) {
            rowCount += result.size();
        }

        stopWatch.stop();
        System.out.println("RowCount: " + rowCount);
        System.out.println("统计耗时：" +stopWatch.getTotalTimeMillis());
    } catch (Throwable e) {
        e.printStackTrace();
    }
}

3、执行Mapreduce任务

zhangsan@node01:/usr/local/hbase-1.1.5/bin$ ./hbase org.apache.hadoop.hbase.mapreduce.RowCounter ‘yourtablename’

这种方式效率比第一种要高，调用的HBase jar中自带的统计行数的类。

4、Hive与HBase整合

我们通过建立Hive和HBase关联表的方式，可以直接在Hive中执行sql语句统计出HBase表的行数。

启动hdfs

zhangsan@node01:/usr/local/hadoop-2.7.2/sbin$ ./start-dfs.sh

启动HBase

zhangsan@node01:/usr/local/hbase-1.1.5/bin$ ./start-hbase.sh
zhangsan@node01:/usr/local/hbase/bin$ jps
3648 Jps
2737 DataNode
3555 HRegionServer
2948 SecondaryNameNode
3337 HQuorumPeer
2604 NameNode
3436 HMaster

启动hiveserver2服务

zhangsan@node01:/usr/local/hive-2.1.0/bin$ hiveserver2

启动HBase Shell，建表

zhangsan@node01:/usr/local/hbase-1.1.5/bin$ hbase shell
# HBase 建表
create 'hbase_hive_test', 'cf1'

hive建立映射表

zhangsan@node01:/usr/local/hive-2.1.0/bin$ hive

hive>create table hive_hbase_test(key int,value string)
    >stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
    >with serdeproperties("hbase.columns.mapping"=":key,cf1:val") 
    >tblproperties("hbase.table.name"="hive_hbase_test");
OK
Time taken: 8.018 seconds

在HBase中查看是否存在映射表

hbase(main):001:0>  list
TABLE                                                                     
hive_hbase_test                                                         
1 row(s) in 0.6800 seconds
=> ["hive_hbase_test"]

5、协处理器Coprocessor实现

该方法是目前最快实现「查询一张表的数据条数」的方法

为什么利用协处理器后速度会如此之快？

Table注册了Coprocessor之后，在执行AggregationClient的时候，会将RowCount分散到Table的每一个Region上，Region内RowCount的计算，是通过RPC执行调用接口，由Region对应的RegionServer执行InternalScanner进行的。

因此，性能的提升有两点原因:

1.分布式统计。将原来客户端按照Rowkey的范围单点进行扫描，然后统计的方式，换成了由所有Region所在RegionServer同时计算的过程。

2.使用了在RegionServer内部执行使用了InternalScanner。这是距离实际存储最近的Scanner接口，存取更加快捷。

public void rowCountByCoprocessor(String tablename){
    try {
        //提前创建connection和conf
        Admin admin = connection.getAdmin();
        TableName name=TableName.valueOf(tablename);
        //先disable表，添加协处理器后再enable表
        admin.disableTable(name);
        HTableDescriptor descriptor = admin.getTableDescriptor(name);
        String coprocessorClass = "org.apache.hadoop.hbase.coprocessor.AggregateImplementation";
        if (! descriptor.hasCoprocessor(coprocessorClass)) {
            descriptor.addCoprocessor(coprocessorClass);
        }
        admin.modifyTable(name, descriptor);
        admin.enableTable(name);

        //计时
        StopWatch stopWatch = new StopWatch();
        stopWatch.start();

        Scan scan = new Scan();
        AggregationClient aggregationClient = new AggregationClient(conf);

        System.out.println("RowCount: " + aggregationClient.rowCount(name, new LongColumnInterpreter(), scan));
        stopWatch.stop();
        System.out.println("统计耗时：" +stopWatch.getTotalTimeMillis());
    } catch (Throwable e) {
        e.printStackTrace();
    }
}

6、参考资料

参考1

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

HBase查询一张表的数据条数的方法

HBase查询一张表的数据条数的方法

0、写在前面

1、HBase-Shell的count命令

2、Scan操作获取数据条数

3、执行Mapreduce任务

4、Hive与HBase整合

5、协处理器Coprocessor实现

6、参考资料

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

HBase查询一张表的数据条数的方法

HBase查询一张表的数据条数的方法

0、写在前面

1、HBase-Shell的count命令

2、Scan操作获取数据条数

3、执行Mapreduce任务

4、Hive与HBase整合

5、协处理器Coprocessor实现

6、参考资料

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品