一个HBase偶现Get慢查询responseTooSlow的分析定位
【摘要】 分享一个HBase查询缓慢的案例
现象: 某集群出现的Get的慢查询,偶然出现,一旦出现时间会在3分钟以上,一般hbase的get在20ms以内,集群明显存出现了问题。
2020-04-01 13:56:59,531 | WARN | RpcServer.FifoWFPBQ.default.handler=5,queue=5,port=16020 | (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1585720435091,"responsesize":71,"method":"Get","param":"region= effect_device_to_gid_android,4000d,1582821419838.415e7569c57a4d5fd63f174917f5c72b., row=401d54b8de049e2c5adcbc62f66a1c1d:3","processingtimems":184439,"client":"XXX.XXX.XXX.XXX:48696","queuetimems":0,"class":"HRegionServer"} | RpcServer.java:2546 |
定位:
此时RS的日志里面没有更多的日志,需要打开trace日志才能定位问题
1.在日志里继续搜索,查询row=401d54b8de049e2c5adcbc62f66a1c1d:3,搜索到其中一条日志包含callId: 8315573
2020-04-01 13:56:59,530 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 param: region= effect_device_to_gid_android,4000d,1582821419838.415e7569c57a4d5fd63f174917f5c72b., row=401d54b8de049e2c5adcbc62f66a1c1d:3 connection: XXX.XXX.XXX.XXX:48696, response result { cell { row: "401d54b8de049e2c5adcbc62f66a1c1d:3" family: "p" qualifier: "gid" timestamp: 1579519444329 cell_type: PUT value: "\000\000\000\000|\345qK" } stale: false partial: false } queueTime: 0 processingTime: 242767 totalTime: 242767 | org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2452) |
2.根据callId: 8315573,搜索日志,可以看到的整个rpc持续了3分钟,确定了此次rpc出现的问题
2020-04-01 13:52:56,763 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 executing as root | org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:105) 2020-04-01 13:56:59,530 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 param: region= effect_device_to_gid_android,4000d,1582821419838.415e7569c57a4d5fd63f174917f5c72b., row=401d54b8de049e2c5adcbc62f66a1c1d:3 connection: XXX.XXX.XXX.XXX:48696, response result { cell { row: "401d54b8de049e2c5adcbc62f66a1c1d:3" family: "p" qualifier: "gid" timestamp: 1579519444329 cell_type: PUT value: "\000\000\000\000|\345qK" } stale: false partial: false } queueTime: 0 processingTime: 242767 totalTime: 242767 | org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2452) |
3.根据RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020去搜索,确定对应时间范围handler处理的过程,通过下面日志可以确定,读取block花费了3分钟
2020-04-01 13:52:56,763 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | callId: 8315573 service: ClientService methodName: Get size: 155 connection: XXX.XXX.XXX.XXX:48696 executing as root | org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:105) 2020-04-01 13:52:56,764 | DEBUG | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | Connecting to datanode XXX.XXX.XXX.XXX:9866 | org.apache.hadoop.hdfs.DFSInputStream.getBestNodeDNAddrPair(DFSInputStream.java:1125) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | Name lookup for XXX.XXX.XXX.XXX took 0 ms. | org.apache.hadoop.security.SecurityUtil.getByName(SecurityUtil.java:538) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | BlockReaderFactory(fileName=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306): trying to construct a BlockReaderLocal for short-circuit reads. | org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:472) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | Address /XXX.XXX.XXX.XXX:9866 is local | org.apache.hadoop.hdfs.DFSUtilClient.isLocalAddress(DFSUtilClient.java:490) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | ShortCircuitCache(0x2d57116a): found waitable for 1076149130_BP-1990332537-XXX.XXX.XXX.XXX-1575543614969 | org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetch(ShortCircuitCache.java:721) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | ShortCircuitReplica{key=1076149130_BP-1990332537-XXX.XXX.XXX.XXX-1575543614969, metaHeader.version=1, metaHeader.checksum=DataChecksum(type=CRC32C, chunkSize=512), ident=0x7a471444, creationTimeMs=10177189191}: checked shared memory segment. isStale=false | org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.isStale(ShortCircuitReplica.java:156) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | ShortCircuitCache(0x2d57116a): evictable no longer contains ShortCircuitReplica{key=1076149130_BP-1990332537-XXX.XXX.XXX.XXX-1575543614969, metaHeader.version=1, metaHeader.checksum=DataChecksum(type=CRC32C, chunkSize=512), ident=0x7a471444, creationTimeMs=10177189191}. refCount 1 -> 2java.lang.Thread.getStackTrace(Thread.java:1559) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | BlockReaderFactory(fileName=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306): returning new block reader local. | org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:369) 2020-04-01 13:52:56,764 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | read(arr.length=5184, off=0, len=5184, filename=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306, canSkipChecksum=true): starting | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.read(BlockReaderLocal.java:552) 2020-04-01 13:56:59,529 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | loaded 130973 bytes into bounce buffer from offset 92807267 of BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306 | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:477) 2020-04-01 13:56:59,529 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | read(arr.length=5184, off=0, len=5184, filename=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306, canSkipChecksum=true): returning 5184 | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.read(BlockReaderLocal.java:565) 2020-04-01 13:56:59,529 | TRACE | RpcServer.FifoWFPBQ.default.handler=19,queue=19,port=16020 | close(filename=/hbase/data/default/effect_device_to_gid_android/415e7569c57a4d5fd63f174917f5c72b/p/c576a827774941398165bd51255683dc, block=BP-1990332537-XXX.XXX.XXX.XXX-1575543614969:blk_1076149130_2408306) | org.apache.hadoop.hdfs.client.impl.BlockReaderLocal.close(BlockReaderLocal.java:626) 4,后续定位就是在HDFS处读取block慢,这块排查HDFS,后发现这RS节点对应的其中一个本地磁盘坏了,需要退服DataNode更换磁盘 |
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)