一个HBase偶现Get慢查询responseTooSlow的分析定位

举报
Lettle whale 发表于 2020/09/22 09:57:20 2020/09/22
【摘要】 分享一个HBase查询缓慢的案例


现象: 某集群出现的Get的慢查询,偶然出现,一旦出现时间会在3分钟以上,一般hbase的get在20ms以内,集群明显存出现了问题。 

2020-04-01 13:56:59,531 | WARN | RpcServer.FifoWFPBQ.default.handler=5,queue=5,port=16020 | (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1585720435091,"responsesize":71,"method":"Get","param":"region= effect_device_to_gid_android,4000d,1582821419838.415e7569c57a4d5fd63f174917f5c72b., row=401d54b8de049e2c5adcbc62f66a1c1d:3","processingtimems":184439,"client":"XXX.XXX.XXX.XXX:48696","queuetimems":0,"class":"HRegionServer"} | RpcServer.java:2546


定位


此时RS的日志里面没有更多的日志,需要打开trace日志才能定位问题

1.在日志里继续搜索,查询row=401d54b8de049e2c5adcbc62f66a1c1d:3,搜索到其中一条日志包含callId: 8315573

2020-04-01 13:56:59,530 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 param: region= effect_device_to_gid_android,4000d,1582821419838.415e7569c57a4d5fd63f174917f5c72b., row=401d54b8de049e2c5adcbc62f66a1c1d:3 connection: XXX.XXX.XXX.XXX:48696, response result { cell { row: "401d54b8de049e2c5adcbc62f66a1c1d:3" family: "p" qualifier: "gid" timestamp: 1579519444329 cell_type: PUT value: "\000\000\000\000|\345qK" } stale: false partial: false } queueTime: 0 processingTime: 242767 totalTime: 242767 | org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2452)

2.根据callId: 8315573,搜索日志,可以看到的整个rpc持续了3分钟,确定了此次rpc出现的问题

2020-04-01 13:52:56,763 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 executing as root | org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:105)
2020-04-01 13:56:59,530 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 param: region= effect_device_to_gid_android,4000d,1582821419838.415e7569c57a4d5fd63f174917f5c72b., row=401d54b8de049e2c5adcbc62f66a1c1d:3 connection: XXX.XXX.XXX.XXX:48696, response result { cell { row: "401d54b8de049e2c5adcbc62f66a1c1d:3" family: "p" qualifier: "gid" timestamp: 1579519444329 cell_type: PUT value: "\000\000\000\000|\345qK" } stale: false partial: false } queueTime: 0 processingTime: 242767 totalTime: 242767 | org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2452)


3.根据RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020去搜索,确定对应时间范围handler处理的过程,通过下面日志可以确定,读取block花费了3分钟

 2020-04-01 13:52:56,763 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 executing as root | org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:105)
2020-04-01 13:52:56,764 | DEBUG | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | Connecting to datanode XXX.XXX.XXX.XXX:9866 | org.apache.hadoop.hdfs.DFSInputStream.getBestNodeDNAddrPair(DFSInputStream.java:1125)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | Name lookup for XXX.XXX.XXX.XXX took 0 ms. | org.apache.hadoop.security.SecurityUtil.getByName(SecurityUtil.java:538)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | BlockReaderFactory(fileName=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306): trying to construct a BlockReaderLocal for short-circuit reads. | org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:472)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | Address /XXX.XXX.XXX.XXX:9866 is local | org.apache.hadoop.hdfs.DFSUtilClient.isLocalAddress(DFSUtilClient.java:490)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | ShortCircuitCache(0x2d57116a): found waitable for 1076149130_BP-1990332537-XXX.XXX.XXX.XXX-1575543614969 | org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetch(ShortCircuitCache.java:721)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | ShortCircuitReplica{key=1076149130_BP-1990332537-XXX.XXX.XXX.XXX-1575543614969, metaHeader.version=1, metaHeader.checksum=DataChecksum(type=CRC32C, chunkSize=512), ident=0x7a471444, creationTimeMs=10177189191}: checked shared memory segment. isStale=false | org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.isStale(ShortCircuitReplica.java:156)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | ShortCircuitCache(0x2d57116a): evictable no longer contains ShortCircuitReplica{key=1076149130_BP-1990332537-XXX.XXX.XXX.XXX-1575543614969, metaHeader.version=1, metaHeader.checksum=DataChecksum(type=CRC32C, chunkSize=512), ident=0x7a471444, creationTimeMs=10177189191}. refCount 1 -> 2java.lang.Thread.getStackTrace(Thread.java:1559)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | BlockReaderFactory(fileName=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306): returning new block reader local. | org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:369)
2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | read(arr.length=5184, off=0, len=5184, filename=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306, canSkipChecksum=true): starting | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.read(BlockReaderLocal.java:552)
2020-04-01 13:56:59,529 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | loaded 130973 bytes into bounce buffer from offset 92807267 of BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306 | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:477)
2020-04-01 13:56:59,529 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | read(arr.length=5184, off=0, len=5184, filename=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306, canSkipChecksum=true): returning 5184 | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.read(BlockReaderLocal.java:565)
2020-04-01 13:56:59,529 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | close(filename=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306) | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.close(BlockReaderLocal.java:626)

4,后续定位就是在HDFS处读取block慢,这块排查HDFS,后发现这RS节点对应的其中一个本地磁盘坏了,需要退服DataNode更换磁盘





【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。