赞
踩
报错类似于这样:
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
报错是备库事务或者单SQL查询时间长,和备库的日志apply发生冲突,如果业务上有长事务、长查询,主库上又再修改同一行数据,很容易造成备库的wal日志无法apply。
wal无法apply数据库有两个策略:
代价1,主库膨胀,因为垃圾版本要延迟若干个事务后才能被回收。
代价2,重复扫描垃圾版本,重复耗费垃圾回收进程的CPU资源。(n_dead_tup会一直处于超过垃圾回收阈值的状态,从而autovacuum 不断唤醒worker进行回收动作)。
当主库的 autovacuum_naptime=很小的值,同时autovacuum_vacuum_scale_factor=很小的值时,尤为明显。
代价3,如果期间发生大量垃圾,垃圾版本可能会在事务到达并解禁后,爆炸性的被回收,产生大量的WAL日志,从而造成WAL的写IO尖刺。
如果备库出现了LONG QUERY,或者Repeatable Read的长事务,并且主库对备库还需要或正查询的数据执行了更新并产生了垃圾时,主库会保留这部分垃圾版本(与vacuum_defer_cleanup_age效果类似)。
代价,与vacuum_defer_cleanup_age > 0 一样。
max_standby_archive_delay和max_standby_streaming_delay
代价,如果备库的QUERY与APPLY(恢复进程)冲突,那么备库的apply会出现延迟,也许从备库读到的是N秒以前的数据。
PG9.4可复现,PG10已解决
https://github.com/digoal/blog/blob/master/201704/20170410_03.md
备库开hot_standby_feedback,备库起可重复读或以上长事务,主库更新表造成主库CPU IO升高。
参数 | 值 | 含义 |
---|---|---|
autovacuum_naptime | 1 | 运行周期默认60s |
autovacuum_vacuum_scale_factor | 0.005 | 当表更新或者删除的元组数超过autovacuum_vacuum_threshold+ autovacuum_vacuum_scale_factor* table_size会触发VACUUM,该参数每个表可以单独设置 |
autovacuum_vacuum_threshold | 50 | 配合autovacuum_vacuum_scale_factor使用 |
vacuum_defer_cleanup_age | 0 | 代价1,主库膨胀,因为垃圾版本要延迟若干个事务后才能被回收。代价2,重复扫描垃圾版本,重复耗费垃圾回收进程的CPU资源。(n_dead_tup会一直处于超过垃圾回收阈值的状态,从而autovacuum 不断唤醒worker进行回收动作)。 |
hot_standby_feedback | off | 备库查询的一些信息反馈给主库,保证正在查询的数据不会因为主库的更改而失败。如果备库出现了LONG QUERY,或者Repeatable Read的长事务,并且主库对备库还需要或正查询的数据执行了更新并产生了垃圾时,主库会保留这部分垃圾版本(与vacuum_defer_cleanup_age效果类似)。 |
max_standby_archive_delay | ||
max_standby_streaming_delay |
备库hot_standby_feedback=on
drop table test1;
create table test1(id int , info text, crt_time timestamp);
insert into test1 select 1,md5(random()::text),now() from generate_series(1,10000000);
--备库
begin transaction isolation level repeatable read;
select count(*) from test1;
-- 主库
update test1 set info=info;
select * from pg_stat_all_tables where relname ='test1';
主库发现1000万条dead tuple
CPU单核增加40%左右,IO上升后缓慢降低
主库降低vacuum频率
autovacuum_naptime=60
deadtuple还没有被回收掉,但是cpu已经降下来了
t1=# select * from pg_stat_all_tables where relname ='test'; -[ RECORD 1 ]-------+------------------------------ relid | 32794 schemaname | public relname | test seq_scan | 2 seq_tup_read | 20000000 idx_scan | idx_tup_fetch | n_tup_ins | 10000000 n_tup_upd | 20000000 n_tup_del | 0 n_tup_hot_upd | 18 n_live_tup | 10000000 n_dead_tup | 10000000 n_mod_since_analyze | 0 last_vacuum | last_autovacuum | 2018-11-14 19:53:57.14633+08 last_analyze | last_autoanalyze | 2018-11-14 18:48:13.783304+08 vacuum_count | 0 autovacuum_count | 1805 analyze_count | 0 autoanalyze_count | 3
备库关闭hot_standby_feedback=off,长事务还在
deadtuple立即被回收
t1=# select * from pg_stat_all_tables where relname ='test'; -[ RECORD 1 ]-------+------------------------------ relid | 32794 schemaname | public relname | test seq_scan | 2 seq_tup_read | 20000000 idx_scan | idx_tup_fetch | n_tup_ins | 10000000 n_tup_upd | 20000000 n_tup_del | 0 n_tup_hot_upd | 18 n_live_tup | 10000000 n_dead_tup | 0 n_mod_since_analyze | 0 last_vacuum | last_autovacuum | 2018-11-14 20:02:02.450611+08 last_analyze | last_autoanalyze | 2018-11-14 18:48:13.783304+08 vacuum_count | 0 autovacuum_count | 1813 analyze_count | 0 autoanalyze_count | 3
--主库 create table test1(id int , info text, crt_time timestamp); insert into test1 select 1,md5(random()::text),now() from generate_series(1,1000); -- 备库 begin; select count(*) from test1; -- 主库 delete from test1 ; -- 在备库ps进程发现waiting -- postgres: startup process recovering 000000010000000800000075 waiting -- 备库 select count(*) from test1; FATAL: terminating connection due to conflict with recovery DETAIL: User was holding a relation lock for too long. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded.
主库delete后,触发autovacuum操作,自动下做了一次truncate操作:
看下WAL日志发现:
DELETE off 28 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 29 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 30 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 31 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 32 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 33 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 34 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 35 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 36 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
DELETE off 37 KEYS_UPDATED , blkref #0: rel 1663/13212/18071 blk 9
......
TRUNCATE base/13212/18071 to 0 blocks flags 7
......
主库下发了一条truncate,与备库事务冲突,startup process进程开始waiting
...
postgres: startup process recovering 000000010000000800000075 waiting
...
postgres: wal receiver process streaming 8/75724CF8
psql -p 3004 postgres
postgres: pg10192893 postgres [local] idle in transaction
超时之后,备库事务报错
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
其实这样的报错在主库执行一些DDL很容易出来(执行和备库事务内锁冲突的语句)autovacuum触发truncate
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。