postgresql事务死锁和连接池数不够导致node项目假死

  |  

出现原因

客户量起来,创建订单越来越多,因为之前是用 typeOrm 的默认配置,看了源码,发现默认配置是 10,所以导致客户请求不了。后来解决了连接池问题,又出现数据库死锁,导致数据库 cpu 直接 100%,又导致客户订单创建不了,产生生产损失。

解决

线索池问题

  1. 排查
    当时出现 node 程序假死,但是数据库,redis 都是正常,后面排查连接数是否正常,发现连接数很少,当时很懵,不知道为啥,后面想到 typeorm 连接池配置,发现了问题。

修改 typeorm 配置

1
2
3
4
5
6
keepAlive: true,
extra: {
poolSize: 800,
mixins: 0,
connectTimeoutMS: 60000,
},
  1. 查询数据库允许的最大连接数
1
show max_connections;
  1. 查看当前连接数
1
2
3
4
5
6
-- 分组求和
select count(*), usename from pg_stat_activity group by usename;


-- 总数
select count(*) from pg_stat_activity

数据库死锁问题

  1. 排查
    也是因为客户量增加,交易频繁,导致大面积数据库死锁,数据库 cpu 100%,影响客户使用,后面百度搜索出现死锁原因,发现原来表加锁顺序问题。

  2. 什么是数据库死锁
    在操作系统领域当中,死锁指的是两个或者两个以上的进程在运行的过程中,因为争夺共同的访问资源而相互等待阻塞,最终造成阻碍进程继续执行的一种阻塞现象。那么在数据库领域当中死锁又是怎样的表现形式呢?

如下图所示,假设事务 A 持有行 1 的共享锁,事务 B 持有行 2 的共享锁,那么此时事务 A 请求持有行 2 的排他锁,那么在事务 B 释放资源之前都处于阻塞等待的状态,同样的事务 B 请求持有行 1 的排他锁,在事务 A 释放资源之前同样也是处于阻塞等待的状态。也就是说事务 B 完成之后事务 A 才能完成,而事务 A 的完成又依赖于事务 B 的完成,这就形成了循环依赖的问题,最终导致死锁情况的发生。

测试 sql 如下(按照步骤执行到第三步就会出现死锁):
sql1:

1
2
3
4
5
-- 执行第一步
BEGIN;
SELECT * from "user" where "id" = '03b44ddb-0813-40c2-bbba-5ca51fc2c57f' FOR UPDATE;
-- 执行第三步
SELECT * from "user" where "id" = 'bfdf2d04-d330-4134-8057-7b77ceaaf8d0' FOR UPDATE;

sql2:

1
2
3
4
5
-- 执行第二步
BEGIN;
SELECT * from "user" where "id" = 'bfdf2d04-d330-4134-8057-7b77ceaaf8d0' FOR UPDATE;
-- 执行第四步
SELECT * from "user" where "id" = '03b44ddb-0813-40c2-bbba-5ca51fc2c57f' FOR UPDATE;
  1. 死锁排查

创建视图用于查询死锁

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
create view v_locks_monitor as
with
t_wait as
(
select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted,
a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath,
b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name
from pg_locks a,pg_stat_activity b where a.pid=b.pid and not a.granted
),
t_run as
(
select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted,
a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath,
b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name
from pg_locks a,pg_stat_activity b where a.pid=b.pid and a.granted
),
t_overlap as
(
select r.* from t_wait w join t_run r on
(
r.locktype is not distinct from w.locktype and
r.database is not distinct from w.database and
r.relation is not distinct from w.relation and
r.page is not distinct from w.page and
r.tuple is not distinct from w.tuple and
r.virtualxid is not distinct from w.virtualxid and
r.transactionid is not distinct from w.transactionid and
r.classid is not distinct from w.classid and
r.objid is not distinct from w.objid and
r.objsubid is not distinct from w.objsubid and
r.pid <> w.pid
)
),
t_unionall as
(
select r.* from t_overlap r
union all
select w.* from t_wait w
)
select locktype,datname,relation::regclass,page,tuple,virtualxid,transactionid::text,classid::regclass,objid,objsubid,
string_agg(
'Pid: '||case when pid is null then 'NULL' else pid::text end||chr(10)||
'Lock_Granted: '||case when granted is null then 'NULL' else granted::text end||' , Mode: '||case when mode is null then 'NULL' else mode::text end||' , FastPath: '||case when fastpath is null then 'NULL' else fastpath::text end||' , VirtualTransaction: '||case when virtualtransaction is null then 'NULL' else virtualtransaction::text end||' , Session_State: '||case when state is null then 'NULL' else state::text end||chr(10)||
'Username: '||case when usename is null then 'NULL' else usename::text end||' , Database: '||case when datname is null then 'NULL' else datname::text end||' , Client_Addr: '||case when client_addr is null then 'NULL' else client_addr::text end||' , Client_Port: '||case when client_port is null then 'NULL' else client_port::text end||' , Application_Name: '||case when application_name is null then 'NULL' else application_name::text end||chr(10)||
'Xact_Start: '||case when xact_start is null then 'NULL' else xact_start::text end||' , Query_Start: '||case when query_start is null then 'NULL' else query_start::text end||' , Xact_Elapse: '||case when (now()-xact_start) is null then 'NULL' else (now()-xact_start)::text end||' , Query_Elapse: '||case when (now()-query_start) is null then 'NULL' else (now()-query_start)::text end||chr(10)||
'SQL (Current SQL in Transaction): '||chr(10)||
case when query is null then 'NULL' else query::text end,
chr(10)||'--------'||chr(10)
order by
( case mode
when 'INVALID' then 0
when 'AccessShareLock' then 1
when 'RowShareLock' then 2
when 'RowExclusiveLock' then 3
when 'ShareUpdateExclusiveLock' then 4
when 'ShareLock' then 5
when 'ShareRowExclusiveLock' then 6
when 'ExclusiveLock' then 7
when 'AccessExclusiveLock' then 8
else 0
end ) desc,
(case when granted then 0 else 1 end)
) as lock_conflict
from t_unionall
group by
locktype,datname,relation,page,tuple,virtualxid,transactionid::text,classid,objid,objsubid ;

执行视图,会出现死锁信息,如下图

1
SELECT * FROM v_locks_monitor

在这里插入图片描述

找到”被阻塞者”,找到 pid,如下图

1
select distinct pid from pg_locks where not granted;

在这里插入图片描述

找到”阻塞者”,可以找到执行的 sql,如下图

1
select * from pg_stat_activity where pid=5432;

在这里插入图片描述

杀死当前死锁

1
select pg_terminate_backend(5432);

修复

修改正常加锁的顺序之后,就没有出现当前问题了。

文档

PostgreSQL 死锁了怎么办?

文章目录
  1. 1. 出现原因
  2. 2. 解决
    1. 2.1. 线索池问题
    2. 2.2. 数据库死锁问题
  3. 3. 修复
  4. 4. 文档