GaussDB集群问题总纲
1 原理类
1.1 通信原理 => https://bbs.huaweicloud.com/blogs/239971
1.2 通信视图 => https://bbs.huaweicloud.com/blogs/247543
1.3 资源负载管理 => https://bbs.huaweicloud.com/blogs/239960
1.4 集群管理CM => https://bbs.huaweicloud.com/blogs/244355
1.5 集群管理CM => https://bbs.huaweicloud.com/blogs/224005
1.6 CMS通信机制 => https://bbs.huaweicloud.com/blogs/241853
1.7 LVS基本原理 => https://bbs.huaweicloud.com/blogs/238621
1.8 CPU资源管理 => https://bbs.huaweicloud.com/blogs/237550
2 连接类
2.1 JDBC连接报错 => https://bbs.huaweicloud.com/blogs/244348
Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections |
Invalid username/password, login denied |
No suitable driver found for XXXX |
No pg_hba.conf entry for host |
conflict |
Terminating connection due to administrator command, Session unused timeout |
SSL error: Connection reset |
Connection refused: connect |
Connections could not be acquired from the underlying database |
2.2 LVS异常 => https://bbs.huaweicloud.com/blogs/247267 || https://bbs.huaweicloud.com/blogs/244340
安装报错 |
安装报写入文件权限不足 |
ipvsadn –Ln显示没有CN信息 |
客户端连接gsql报错 |
客户端连接LVS不轮询 |
客户端通过虚拟IP不能连接CN |
卸载LVS导致机器重启 |
检测是否virtual_router_id冲突问题 |
机器重启导致浮动IP丢失CN启动异常 |
2.3 连接断开 => https://bbs.huaweicloud.com/blogs/239471 || http://3ms.huawei.com/km/blogs/details/8697907 || https://bbs.huaweicloud.com/blogs/205970
Too many clients already, active/non_active: xxxx/xxxx |
An I/O error occurred while sending to the backend |
客户端连接CN耗时长 |
Kerberos认证失败 |
集群内部连接报错 |
3 网络类
3.1 重传or丢包 => https://bbs.huaweicloud.com/blogs/235237
3.2 通信异常 => http://3ms.huawei.com/km/blogs/details/2431967?l=zh-cn
集群异常 –> 环境异常 –> 环境问题:防火墙/MTU/网卡加固等 |
集群异常 –> 环境正常 –> 配置问题:监听端口/bind地址/权限等 |
集群正常 –> 偶发故障 –> Core/OS/内存不足/网卡故障/LVS等 |
集群正常 –> 持续故障 –> 死锁/节点异常/连接数满等 |
3.3 通信性能 => https://bbs.huaweicloud.com/blogs/248843
网卡多队列 |
网络流量 |
通信库内存 |
系统调用 |
4 资源类
4.1 资源管理配置 => https://bbs.huaweicloud.com/blogs/244671
无效的服务名/内部未知异常/CPU配额不足等 |
oms到主CMS节点的问题 |
创建租户失败,后台日志报错权限不足 |
创建租户失败,日志报错修改资源池失败 |
4.2 内存异常 => https://bbs.huaweicloud.com/forum/thread-110215-1-1.html || https://bbs.huaweicloud.com/forum/thread-82838-1-1.html || https://bbs.huaweicloud.com/forum/thread-85225-1-1.html || https://bbs.huaweicloud.com/forum/thread-94896-1-1.html
memory temporarily unavailable |
4.3 CPU异常 => https://bbs.huaweicloud.com/forum/thread-76364-1-1.html || https://bbs.huaweicloud.com/forum/thread-79937-1-1.html || https://bbs.huaweicloud.com/forum/thread-70297-1-1.html || https://bbs.huaweicloud.com/forum/thread-73291-1-1.html
CPU使用率超过阈值 |
多租户CPU资源管理 |
5 参数类
5.1 通信参数 => https://bbs.huaweicloud.com/blogs/239863
tcp_keepalives_idle、tcp_keepalives_interval、tcp_keepalives_count |
comm_max_datanode、comm_max_stream、comm_max_receiver |
enable_stateless_pooler_reuse、comm_cn_dn_logic_conn |
comm_quota_size、comm_usable_memory |
net.ipv4.tcp_tw_reuse、net.ipv4.tcp_tw_recycle、net.ipv4.tcp_max_tw_buckets |
net.ipv4.tcp_syn_retries、net.ipv4.tcp_synack_retries |
net.ipv4.tcp_retries、net.ipv4.tcp_retries2 |
6 工具类
6.1 网络流量/重传/丢包 => gsar.sh
6.2 客户端连接状况监控 => clients.py
6.3 网络打流 => speed_test_x86.sh/speed_test_arm
6.4 网络多队列查询/设置 => get_irq_affinity.sh/set_irq_affinity.sh
6.5 网络监控 => network_monitor.py
6.6 通用语句监控 => general.sh
7 总结
未完待续 => 欢迎补充
- 点赞
- 收藏
- 关注作者
评论(0)