【数据库使用】 记一次简单sql执行慢问题定位
一、问题背景:
delete语句执行慢,delete几千条数据,执行要用一分钟,不满足要求。
DELETE FROM schema_css.t_call_record_attendee WHERE confuuid in ('xxx','xxx','xxx');
表结构:
Column | Type | Modifiers | Storage | Stats target | Description
--------------+-------------------------+------------------------------------+----------+--------------+-------------
uuid | character varying(64) | not null | extended | |
confuuid | character varying(64) | not null | extended | |
orguuid | character varying(128) | | extended | |
logicid | character varying(128) | | extended | |
roletype | character varying(16) | default 'GUEST'::character varying | extended | |
displayname | nvarchar2(512) | | extended | |
terminaltype | character varying(32) | | extended | |
callnumber | character varying(128) | | extended | |
deptname | nvarchar2(512) | | extended | |
deptuuid | character varying(128) | | extended | |
useruuid | character varying(128) | | extended | |
attribute | character varying(4096) | | extended | |
updatetime | bigint | default 0 | plain | |
createtime | bigint | | plain | |
deleted | integer | default 0 | plain | |
Indexes:
"t_call_record_attendee_pkey" PRIMARY KEY, cbtree (uuid, confuuid) TABLESPACE pg_default
"i_t_call_record_attendee_confuuid" cbtree (confuuid) TABLESPACE pg_default
"i_t_call_record_attendee_logicid" cbtree (logicid) TABLESPACE pg_default
"i_t_call_record_attendee_useruuid" cbtree (useruuid) TABLESPACE pg_default
Has OIDs: no
Distribute By: HASH(confuuid)
Location Nodes: ALL DATANODES
Options: orientation=column, enable_hstore_opt=true, enable_binlog=on, binlog_ttl=86400, bucketnums=16384, compression=middle, colversion=2.0, enable_delta=false, enable_hstore=true, enable_turbo_store=true
二、问题定位
1、无锁冲突,无性能瓶颈,执行的是delete的语句,无法打印执行计划;topsql中记录的是fqs,也没有计划。
将delete转化为select语句执行:
select * FROM schema_css.t_call_record_attendee WHERE confuuid in ('xxx','xxx','xxx');
走FQS执行计划,执行很快;关闭fqs后,走了索引,依然很快,未出现慢的情况。
关闭fqs后,打印delete的verbose计划,和select快的计划是一致的,未发现问题。由于FQS计划相当于直连DN执行,因此直连DN打印执行计划,发现也是走了索引的快的计划,性能很快,未能复现。

2、根据等待视图,抓取语句堆栈如下:
postgres=# select * from pgxc_thread_wait_status where query_id in (select query_id from pgxc_stat_activity where query like '%T_CALL_RECORD_ATTENDEE%' and usename !='Ruby' order by query_start desc limit 1);
node_name | db_name | thread_name | query_id | tid | lwtid | ptid | tlevel | smpid | wait_status | wait_event
--------------+---------+------------------------+--------------------+-----------------+---------+------+--------+-------+----------------------------------+------------
dn_6003_6004 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 139795727430888 | 3127403 | | 0 | 0 | none |
dn_6009_6010 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 140270667850240 | 4116415 | | 0 | 0 | none |
dn_6011_6012 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 139655937670760 | 4116416 | | 0 | 0 | none |
dn_6001_6002 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 139846959760552 | 3127402 | | 0 | 0 | none |
dn_6007_6008 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 140166692434008 | 370635 | | 0 | 0 | none |
dn_6005_6006 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 139766594408728 | 370634 | | 0 | 0 | none |
cn_5002 | mms_db | PostgreSQL JDBC Driver | 145241088536308910 | 139761117827248 | 2735702 | | 0 | 0 | wait node(total 6): dn_6011_6012 |
3、dn抓堆栈
postgres=# SELECT * FROM gs_stack('dn_6001_6002',139846959760552);
tid | lwtid | stack
-----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------
139846959760552 | 3127402 | texteq(FunctionCallInfoData*) + 0x2
| | ExecEvalScalarArrayOp(ScalarArrayOpExprState*, ExprContext*, bool*, ExprDoneCond*) + 0x41c
| | ExecQual(List*, ExprContext*, bool) + 0x98
| | ExecResult(ResultState*) + 0x7a
| | TupleTableSlot* ExecProcNodeT<false, false, false, false, (NodeTag)201>(PlanState*) + 0x8c
| | ExecRowToVecTupleMode(RowToVecState*) + 0x51
| | VectorEngine(PlanState*) + 0x14a
| | FetchRows(VecModifyTableState*, EState*, CmdType&, PlanState*, PlanState*, JunkFilter*, RelationData*, void*, tagVecModifyTableFuncSet const*, bool) + 0x245
| | ExecVecModifyTable(VecModifyTableState*) + 0x1ed
| | VectorEngine(PlanState*) + 0x14a
| | ExecVecToRow(VecToRowState*) + 0xda
| | TupleTableSlot* ExecProcNodeT<false, false, false, false, (NodeTag)2000>(PlanState*) + 0x8c
| | standard_ExecutorRun(QueryDesc*, ScanDirection, long) + 0x649
| | ExecutorRun(QueryDesc*, ScanDirection, long) + 0x375
| | ProcessQuery(PlannedStmt*, char const*, ParamListInfoData*, _DestReceiver*, char*) + 0xc9
| | PortalRunMulti(PortalData*, bool, _DestReceiver*, _DestReceiver*, char*) + 0x287
| | PortalRun(PortalData*, long, bool, _DestReceiver*, _DestReceiver*, char*, bool) + 0x583
| | exec_execute_message(char const*, long) + 0x342
| | PostgresMain(int, char**, char const*, char const*) + 0x15e4
| | SubPostmasterMain(tag_gs_thread_args*) + 0x1209
| | MainStarterThreadFunc(void*) + 0x3f
| | ThreadStarterFunc(void*) + 0x40
| | 0x7f323d8dc67a
| | 0x7f323d95f160
| |
(1 row)
发现堆栈和快的计划对不上,快的计划走了索引,而堆栈中走了全表扫描。
4、继续对比复现语句和客户实际业务语句发下,慢的情况是PBE(Prepare-Bind-Execute 预编译)后,参数是通过变量的方式传到条件里的,而复现问题时,条件都是直接写好的。
客户的业务语句是通过JDBC绑定变量的方式执行的,形如:
DELETE FROM schema_css.t_call_record_attendee WHERE confuuid in ($1, $2, $3);
而在复现时,将其中的$1等变量内容替换成了实际常量:
DELETE FROM schema_css.t_call_record_attendee WHERE confuuid in (1234, 1235, 1236);
怀疑是PBE场景导致的执行计划差异。
使用PBE方式直连DN模拟PBE的FQS计划,发现执行计划和抓到的堆栈匹配,问题复现,确认和PBE有关。
prepare p1(text,text) as select * FROM schema_css.t_call_record_attendee a WHERE confuuid in( $1, $2);
explain verbose execute p1('SFyuC4xX3X7yQHzXqAQxv8md3MHgfoWq','MUZASKbhpDBQwK1A2faPpRzw58qRvV2Z');

关闭FQS后,在CN上执行,生成了索引计划,执行性能较快,因此可将关闭FQS作为规避方案
set enable_fast_query_shipping=off;
prepare p1(text,text) as select * FROM schema_css.t_call_record_attendee a WHERE confuuid in( $1, $2);
explain verbose execute p1('SFyuC4xX3X7yQHzXqAQxv8md3MHgfoWq','MUZASKbhpDBQwK1A2faPpRzw58qRvV2Z');

5、用户级修改参数
alter user xxx set enable_fast_query_shipping=off;
中间有一个小插曲是,用户级修改参数后,新建连接立即生效,不需要重启,但是老连接不生效,因此需要用户新建连接测试规避方案是否有效。
可以通过审计日志查看用户登入登出时间,确保在参数调整后新建连接:
select * from pgxc_query_audit('2025-08-15 00:00:00','2025-08-15 06:30:00') where session_id like '%对应的pid%' and detail_info not in ('START TRANSACTION','ROLLBACK','COMMIT') and operation_type ='login_logout' order by begintime;
修改参数生效后,验证问题得到解决。
三、问题根因
通过PBE的方式执行,array用法在PBE场景下不支持向量化,in条件会被转化为array,直接执行时,走的FQS计划,导致走了全表扫描,再走行存过滤表达式,性能差。关闭FQS后,变量在CN上被替换,替换后支持向量化,因此可以生成索引扫描计划,性能快。
规避方法:
(1)通过hint关闭fqs : /*+ set global(enable_fast_query_shipping off) */,仅影响当前语句
(2)用户级关闭参数: alter user xxx set enable_fast_query_shipping=off; 仅影响当前用户,一般不会产生其他副作用
- 点赞
- 收藏
- 关注作者
评论(0)