当前位置:   article > 正文

PostgreSQL基于repmgr 部署流复制架构-01_postgres12.15-contrib

postgres12.15-contrib

1 Repmgr概述

1.1 简介

官方网站:repmgr - Replication Manager for PostgreSQL clusters

repmgr是一个2ndQuadrant开发的一款复制的开源工具套件,用于管理PostgreSQL服务器集群中的复制和故障转移。最初,它主要是为了简化流副本的管理,后来发展成为一个完整的故障转移管理套件。它通过设置备用服务器,监视复制以及执行管理任务(如故障转移或手动切换操作)的工具,增强了PostgreSQL内置的热备份功能。

repmgr与声名远扬的ORACLE ADG逻辑复制工具非常类似。它的功能强大,安装和配置简单,有很强的可操控性

其中witness节点是防止出现脑裂的一种有效方法。

witness节点是一个普通的数据库实例,并不是流复制的一部分:它的作用是如果发生故障转移的情况,提供见证信息从而证明是主节点本身不可用,而不是不同的物理网络中断导致的脑裂。

在主节点的同一网络创建witness服务,如果主节点不可用,则备节点可以决定是否可以在不脑裂风险的情况下提升为主节点:如果备节点网络上只是和witness或主节点中的一个节点不通,则很可能存在网络中断,它不应该切换为主节点。如果备节点和witness节点相通,但和主节点不通,这证明不是网络中断,而是主节点本身不可用,因此它可以切换为主节点。

1.2 特点

repmgr的特点是非常轻量,单功能全面

  • 故障检测和自动故障切换:repmgr 可以检测到主服务器故障并自动切换到备用服务器。
  • 自动故障恢复:repmgr 可以检测到从服务器故障并自动将其重新加入到复制拓扑中。
  • 多个备用服务器:repmgr 支持多个备用服务器,可以在主服务器故障时自动切换到最合适的备用服务器。
  • 灵活的复制拓扑:repmgr 支持各种复制拓扑,包括单主服务器和多主服务器。
  • 管理和监控:repmgr 提供了用于管理和监控PostgreSQL复制的各种工具和命令。

可以说 repmgr 是一个扩展模块,简化了 PostgreSQL 复制的管理和维护,提高系统的可靠性和可用性。它是一个非常有用的工具,特别是对于需要高可用性的生产环境。同时 repmgr 也是由 Postgresql 社区开发以及维护的。

主要提供了两个工具:

  1. repmgr
  2. #用于执行管理任务的命令行工具 设置备用服务器,将备用服务器提升为主服务器,切换主服务器和备用服务器,显示复制群集中服务器的状态
  3. repmgrd
  4. #主动监视复制群集中的服务器的守护程序 监视和记录复制性能,通过检测主数据库和提升最合适的备用服务器,向用户定义的群集中事件提供有关事件的通知 可以执行任务的脚本,例如通过电子邮件发送警报

1.3 版本说明

官方链接:https://www.repmgr.org/docs/current/install-requirements.html#INSTALL-COMPATIBILITY-MATRIX

repmgr version

Supported?

Latest release

Supported PostgreSQL versions

Notes

repmgr 5.4

YES

5.4.1 (2023-07-04)

10, 11, 12, 13, 14, 15, 16

repmgr 5.3

YES

5.4.1 (2023-07-04)

9.4, 9.5, 9.6, 10, 11, 12, 13, 14, 15

PostgreSQL 15 supported from repmgr 5.3.3

repmgr 5.2

NO

5.2.1 (2020-12-07)

9.4, 9.5, 9.6, 10, 11, 12, 13

repmgr 5.1

NO

5.1.0 (2020-04-13)

9.3, 9.4, 9.5, 9.6, 10, 11, 12

repmgr 5.0

NO

5.0 (2019-10-15)

9.3, 9.4, 9.5, 9.6, 10, 11, 12

repmgr 4.x

NO

4.4 (2019-06-27)

9.3, 9.4, 9.5, 9.6, 10, 11

repmgr 3.x

NO

3.3.2 (2017-05-30)

9.3, 9.4, 9.5, 9.6

repmgr 2.x

NO

2.0.3 (2015-04-16)

9.0, 9.1, 9.2, 9.3, 9.4

2 Postgresql+repmgr高可用环境部署

2.1 环境说明

节点说明

节点配置

节点地址

节点部署

master

CentOS7.9 4c/8G

192.168.16.220

Postgresql 、repmgr

slave1

CentOS7.9 4c/8G

192.168.16.221

Postgresql 、repmgr

slave2

CentOS7.9 4c/8G

192.168.16.222

Postgresql 、repmgr

witness

CentOS7.9 4c/8G

192.168.16.223

Postgresql 、repmgr

2.2 PostgreSQL部署

2.2.1 环境准备

四个节点都执行

  1. #安装依赖
  2. yum -y install wget flex libselinux-devel readline-devel zlib zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python python-devel tcl-devel systemd-devel pcre-devel gcc gcc-c++ make tree psmisc
  3. yum -y groupinstall "Development Tools"
  4. #创建用户
  5. groupadd -g 5432 postgres
  6. useradd -u 5432 -g postgres postgres
  7. echo "2lp7VaR9um6g" | passwd --stdin postgres
  8. #创建安装目录
  9. mkdir -p /opt/pg/soft # 存放软件包
  10. mkdir -p /opt/pg/pg12/12.15 # PG_HOME
  11. mkdir /opt/pg/pgdata # PG_DATA
  12. mkdir /opt/pg/pgwal # 存放wal文件
  13. mkdir /opt/pg/pgarch # 存放wal归档文件
  14. mkdir /opt/pg/pglog # 存放PostgreSQL的软件日志文件
  15. chown -R postgres:postgres /opt/pg/
  16. chmod 0700 /opt/pg/pgdata /opt/pg/pgwal /opt/pg/pgarch
  17. # 创建目录软连接,方便日后数据库软件升级
  18. ln -s /opt/pg/pg12 /opt/pg/pgsql
  19. #创建主机映射
  20. vi /etc/hosts
  21. 192.168.16.220 master
  22. 192.168.16.221 slave1
  23. 192.168.16.222 slave2
  24. 192.168.16.223 witness
  25. #节点互信
  26. ssh-keygen -t rsa
  27. for i in 192.168.16.220 192.168.16.221 192.168.16.222 192.168.16.223;do ssh-copy-id -i $i;done
  28. #切换到postgres进行节点互信,因为后面通过节点切换的时候,需要以postgres用户来ssh节点,否则故障切换会失败
  29. su - postgres
  30. ssh-keygen -t rsa
  31. for i in 192.168.16.220 192.168.16.221 192.168.16.222 192.168.16.223;do ssh-copy-id -i $i;done

2.2.2 数据库节点初始化

四个节点都执行

以下内核优化根据具体环境情况进行:内核优化

  1. cat >> /etc/sysctl.conf << EOF
  2. #for postgres db 12.15
  3. kernel.shmall = 966327 # expr `free |grep Mem|awk '{print $2 *1024}'` / `getconf PAGE_SIZE`
  4. kernel.shmmax = 3958075392 # free |grep Mem|awk '{print $2 *1024}'
  5. kernel.shmmni = 4096
  6. kernel.sem = 50100 64128000 50100 1280
  7. fs.file-max = 76724200
  8. net.ipv4.ip_local_port_range = 9000 65000
  9. net.core.rmem_default = 1048576
  10. net.core.rmem_max = 4194304
  11. net.core.wmem_default = 262144
  12. net.core.wmem_max = 1048576
  13. net.ipv4.tcp_tw_recycle = 1
  14. net.ipv4.tcp_max_syn_backlog = 4096
  15. net.core.netdev_max_backlog = 10000
  16. vm.overcommit_memory = 0
  17. fs.aio-max-nr = 40960000
  18. net.ipv4.tcp_timestamps = 0
  19. vm.dirty_ratio=20
  20. vm.dirty_background_ratio=3
  21. vm.dirty_writeback_centisecs=100
  22. vm.dirty_expire_centisecs=500
  23. vm.swappiness=10
  24. vm.min_free_kbytes=524288
  25. EOF

节点资源限制

  1. cat >> /etc/security/limits.conf << EOF
  2. #for postgres db 12.15
  3. * soft nofile 1048576
  4. * hard nofile 1048576
  5. * soft nproc unlimited
  6. * hard nproc unlimited
  7. * soft core unlimited
  8. * hard core unlimited
  9. * soft memlock unlimited
  10. * hard memlock unlimited
  11. EOF
  12. cat >> /etc/pam.d/login << EOF
  13. #for postgres db 12.15
  14. session required pam_limits.so
  15. EOF

环境变量配置

  1. cat >> /etc/profile << EOF
  2. #for postgres db 12.15
  3. export LANG=en_US.utf8
  4. export PGHOME=/opt/pg/pgsql/12.15/
  5. export PGUSER=postgres
  6. export PGPORT=5432
  7. export PGDATA=/opt/pg/pgdata
  8. export PATH=\$PGHOME/bin:\$PATH:\$HOME/bin
  9. export LD_LIBRARY_PATH=\$PGHOME/lib:\$LD_LIBRARY_PATH
  10. EOF
  11. source /etc/profile # 使环境变量生效

2.2.3 安装postgreSQL

官方下载链接:PostgreSQL: File Browser

安装包:postgresql-12.15.tar.gz

1)编译安装数据库(四个节点都执行)

  1. #解压安装包
  2. cd /opt/pg/soft/
  3. tar -xzvf postgresql-12.15.tar.gz
  4. # 编译安装
  5. cd postgresql-12.15/
  6. ./configure --prefix=/opt/pg/pgsql/12.15/ --enable-debug --enable-cassert --enable-depend CFLAGS=-O0 # --with-pgport=6000
  7. make -j 4 && make install
  8. # 安装工具集
  9. cd /opt/pg/soft/postgresql-12.15/contrib
  10. make -j 4 && make install
  11. # 查询版本,确认安装成功
  12. postgres --version # postgres (PostgreSQL) 12.15

2)数据库初始化

主节点和witness节点执行

  1. su - postgres
  2. # 从pg11起,initdb设置WAL段的大小 --wal-segsize=32 单位MB
  3. $ initdb --pgdata=/opt/pg/pgdata --waldir=/opt/pg/pgwal --encoding=UTF8 --allow-group-access --data-checksums --username=postgres --pwprompt --wal-segsize=32

3)配置数据库参数

主节点和witness节点执行

  1. vi $PGDATA/postgresql.conf
  2. listen_addresses = '*'
  3. max_wal_senders = 10
  4. max_replication_slots = 10
  5. wal_level = replica
  6. hot_standby = on
  7. wal_log_hints = on
  8. full_page_writes=on
  9. shared_preload_libraries = 'repmgr'
  10. wal_keep_segments=100
  11. archive_mode = on
  12. archive_command = 'test ! -f /opt/pg/pgarch/%f && cp %p /opt/pg/pgarch/%f'

4)配置访问策略

主节点和witness节点执行

  1. vi $PGDATA/pg_hba.conf
  2. #修改为如下:
  3. host all all 192.168.16.0/24 md5

2.2.4 安装repmgr

2.2.4.1 repmgr安装

四个节点执行

  1. #下载并解压
  2. cd /opt/pg/soft/
  3. wget -c https://repmgr.org/download/repmgr-5.3.3.tar.gz
  4. tar -zxvf repmgr-5.3.3.tar.gz -C /opt/pg/soft/postgresql-12.15/contrib
  5. #编译安装
  6. cd /opt/pg/soft/postgresql-12.15/contrib
  7. mv repmgr-5.3.3 repmgr
  8. cd repmgr
  9. yum install flex
  10. ./configure && make install
2.2.4.2 配置修改

master节点

vi /etc/repmgr.conf

  1. #repmgr基本配置信息
  2. node_id=1
  3. node_name='master'
  4. conninfo='host=192.168.16.220 user=repmgr dbname=repmgr connect_timeout=2'
  5. data_directory='/opt/pg/pgdata'
  6. replication_user=repmgr
  7. replication_type=physical
  8. location=myrepmgr
  9. #repmgr日志配置
  10. log_level=INFO
  11. log_facility=STDERR
  12. log_file='/opt/pg/pglog/repmgr.log'
  13. log_status_interval=10
  14. # 启用witness时使用,否则witness即使注册在集群中,也不会影响故障切换
  15. witness_sync_interval=15
  16. primary_visibility_consensus=true
  17. #可执行文件配置
  18. pg_bindir='/opt/pg/pgsql/12.15/bin'
  19. #集群faibver设置
  20. failover='automatic'
  21. promote_command='/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
  22. follow_command='/opt/pg/pgsql/12.15/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

slave1节点

vi /etc/repmgr.conf

  1. #repmgr基本配置信息
  2. node_id=2
  3. node_name='slave1'
  4. conninfo='host=192.168.16.221 user=repmgr dbname=repmgr connect_timeout=2'
  5. data_directory='/opt/pg/pgdata'
  6. replication_user=repmgr
  7. replication_type=physical
  8. location=myrepmgr
  9. #repmgr日志配置
  10. log_level=INFO
  11. log_facility=STDERR
  12. log_file='/opt/pg/pglog/repmgr.log'
  13. log_status_interval=10
  14. # 启用witness使用
  15. witness_sync_interval=15
  16. primary_visibility_consensus=true
  17. #可执行文件配置
  18. pg_bindir='/opt/pg/pgsql/12.15/bin'
  19. #集群faibver设置
  20. failover='automatic'
  21. promote_command='/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
  22. follow_command='/opt/pg/pgsql/12.15/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

slave2节点

vi /etc/repmgr.conf

  1. #repmgr基本配置信息
  2. node_id=3
  3. node_name='slave2'
  4. conninfo='host=192.168.16.222 user=repmgr dbname=repmgr connect_timeout=2'
  5. data_directory='/opt/pg/pgdata'
  6. replication_user=repmgr
  7. replication_type=physical
  8. location=myrepmgr
  9. #repmgr日志配置
  10. log_level=INFO
  11. log_facility=STDERR
  12. log_file='/opt/pg/pglog/repmgr.log'
  13. log_status_interval=10
  14. # 启用witness使用
  15. witness_sync_interval=15
  16. primary_visibility_consensus=true
  17. #可执行文件配置
  18. pg_bindir='/opt/pg/pgsql/12.15/bin'
  19. #集群faibver设置
  20. failover='automatic'
  21. promote_command='/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
  22. follow_command='/opt/pg/pgsql/12.15/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

witness节点

vi /etc/repmgr.conf

  1. #repmgr基本配置信息
  2. node_id=4
  3. node_name='witness'
  4. conninfo='host=192.168.16.223 user=repmgr dbname=repmgr connect_timeout=2'
  5. data_directory='/opt/pg/pgdata'
  6. replication_user=repmgr
  7. replication_type=physical
  8. location=myrepmgr
  9. #repmgr日志配置
  10. log_level=INFO
  11. log_facility=STDERR
  12. log_file='/opt/pg/pglog/repmgr.log'
  13. log_status_interval=10
  14. # 启用witness使用
  15. witness_sync_interval=15
  16. primary_visibility_consensus=true
  17. #可执行文件配置
  18. pg_bindir='/opt/pg/pgsql/12.15/bin'
  19. #集群faibver设置
  20. failover='automatic'
  21. promote_command='/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
  22. follow_command='/opt/pg/pgsql/12.15/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

配置参考文档:

基于repmgr的PostgreSQL的高可用搭建 - UCloud云社区

PostgreSQL repmgr高可用集群+keepalived高可用_ITPUB博客

PostgreSQL repmgr高可用集群+keepalived高可用_ITPUB博客

2.2.5 启动主节点数据库

master和witness节点操作

  1. su - postgres
  2. #启动数据库
  3. pg_ctl start -w -D $PGDATA -l /opt/pg/pglog/startup.log
  4. #其他管理命令
  5. #停止数据库
  6. pg_ctl stop -m fast -w -D $PGDATA
  7. #重启数据库
  8. pg_ctl restart -m fast -w -D $PGDATA
  9. #重载数据库配置
  10. pg_ctl reload -D $PGDATA

2.2.6 配置repmgr用户

master和witness节点操作

  1. #创建数据库
  2. create database repmgr;
  3. #创建用户
  4. create user repmgr with password 'repmgr' superuser login;
  5. #授权用户管理数据库权限
  6. alter database repmgr owner to repmgr;

2.2.7 修改pg_hba.conf

master和witness节点执行

  1. vi $PGDATA/pg_hba.conf
  2. # IPv4 local connections: 此配置下增加以下配置
  3. local repmgr repmgr trust
  4. host repmgr repmgr 127.0.0.1/32 trust
  5. host repmgr repmgr 192.168.16.0/24 trust
  6. # replication privilege:此配置下增加以下配置
  7. local replication repmgr trust
  8. host replication repmgr 127.0.0.1/32 trust
  9. host replication repmgr 192.168.16.0/24 trust

注意:配置完成pg_hba.conf后,需要重启数据库

  1. #重启数据库
  2. pg_ctl restart -m fast -w -D $PGDATA

配置密码认证文件

四个节点执行

  1. su - postgres
  2. #创建一个免密登录文件
  3. vi .pgpass
  4. 192.168.16.220:5432:repmgr:repmgr:repmgr
  5. 192.168.16.220:5432:replication:repmgr:repmgr
  6. 192.168.16.221:5432:repmgr:repmgr:repmgr
  7. 192.168.16.221:5432:replication:repmgr:repmgr
  8. 192.168.16.222:5432:repmgr:repmgr:repmgr
  9. 192.168.16.222:5432:replication:repmgr:repmgr
  10. 192.168.16.223:5432:repmgr:repmgr:repmgr
  11. 192.168.16.223:5432:replication:repmgr:repmgr
  12. #修改授权
  13. chmod 600 .pgpass

2.3 repmgr集群构建

2.3.1 master 节点加入集群

  1. # 确保master数据库已开启
  2. #将master数据库注册至集群,并查看状态
  3. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf primary register"
  4. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.3.2 slave1 节点加入集群

  1. #slave1节点,测试连通性并克隆master数据库数据
  2. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -h 192.168.16.220 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone --dry-run"
  3. #删除pg的数据
  4. rm -rf /opt/pg/pgdata/*
  5. #执行同步主库的数据
  6. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -h 192.168.16.220 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone"
  7. #启动slave1节点数据库
  8. su - postgres
  9. pg_ctl start -w -D $PGDATA -l /opt/pg/pglog/startup.log
  10. #slave1节点,将slave1数据库注册到集群,并查看状态
  11. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby register"
  12. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.3.3 slave2 节点加入集群

  1. #slave2节点,测试连通性并克隆master数据库数据
  2. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -h 192.168.16.220 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone --dry-run"
  3. #删除pg的数据
  4. rm -rf /opt/pg/pgdata/*
  5. #执行同步主库的数据
  6. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -h 192.168.16.220 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone"
  7. #启动slave2节点数据库
  8. su - postgres
  9. pg_ctl start -w -D $PGDATA -l /opt/pg/pglog/startup.log
  10. #slave2节点,将slave2数据库注册到集群,并查看状态
  11. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby register"
  12. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.3.4 witness节点加入集群

  1. #witness节点,将witness数据库注册到集群,并查看状态
  2. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf -h 192.168.16.220 witness register"
  3. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.3.5 开启守护进程

四个节点操作

  1. #开启守护进程(故障自动转移)
  2. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgrd -f /etc/repmgr.conf -d -p /tmp/repmgrd.pid"
  3. #停止守护进程
  4. REPMGRD_PID=`ps -ef | grep repmgrd|grep -v grep |awk '{print $2}'`
  5. kill -9 $REPMGRD_PID

2.4 数据库状态

master节点

pg_controldata |grep 'Database cluster state'

slave节点

pg_controldata |grep 'Database cluster state'

2.5 主从故障切换

2.5.1 主从库复制数据

  1. #主库登录操作
  2. psql (12.15)
  3. Type "help" for help.
  4. #创建表
  5. postgres=# create table tb_1 (id int8,create_time timestamp(0) without time zone);
  6. CREATE TABLE
  7. #插入数据集
  8. postgres=# insert into tb_1 values (1,now());
  9. INSERT 0 1
  10. #查看数据
  11. postgres=# select * from tb_1;
  12. id | create_time
  13. ----+---------------------
  14. 1 | 2024-01-15 16:26:03
  15. (1 row)
  1. #从库登录操作
  2. psql (12.15)
  3. Type "help" for help.
  4. postgres=# select * from tb_1;
  5. id | create_time
  6. ----+---------------------
  7. 1 | 2024-01-15 16:26:03
  8. (1 row)

从以上来看数据库主从复制已正常

2.5.2 故障切换验证

  1. #master节点操作,停止数据库模拟主库宕机
  2. su - postgres
  3. pg_ctl stop -m fast -w -D $PGDATA

slave1查看集群切换过程

tail -fn100 /opt/pg/pglog/repmgr.log

  1. [2024-01-16 12:23:05] [WARNING] unable to ping "host=192.168.16.220 user=repmgr dbname=repmgr connect_timeout=2"
  2. [2024-01-16 12:23:05] [DETAIL] PQping() returned "PQPING_REJECT"
  3. [2024-01-16 12:23:05] [WARNING] unable to connect to upstream node "master" (ID: 1)
  4. [2024-01-16 12:23:05] [INFO] checking state of node "master" (ID: 1), 1 of 6 attempts
  5. [2024-01-16 12:23:05] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  6. [2024-01-16 12:23:05] [DETAIL] PQping() returned "PQPING_REJECT"
  7. [2024-01-16 12:23:05] [INFO] sleeping up to 10 seconds until next reconnection attempt
  8. [2024-01-16 12:23:15] [INFO] checking state of node "master" (ID: 1), 2 of 6 attempts
  9. [2024-01-16 12:23:15] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  10. [2024-01-16 12:23:15] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  11. [2024-01-16 12:23:15] [INFO] sleeping up to 10 seconds until next reconnection attempt
  12. [2024-01-16 12:23:25] [INFO] checking state of node "master" (ID: 1), 3 of 6 attempts
  13. [2024-01-16 12:23:25] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  14. [2024-01-16 12:23:25] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  15. [2024-01-16 12:23:25] [INFO] sleeping up to 10 seconds until next reconnection attempt
  16. [2024-01-16 12:23:35] [INFO] checking state of node "master" (ID: 1), 4 of 6 attempts
  17. [2024-01-16 12:23:35] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  18. [2024-01-16 12:23:35] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  19. [2024-01-16 12:23:35] [INFO] sleeping up to 10 seconds until next reconnection attempt
  20. [2024-01-16 12:23:45] [INFO] checking state of node "master" (ID: 1), 5 of 6 attempts
  21. [2024-01-16 12:23:45] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  22. [2024-01-16 12:23:45] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  23. [2024-01-16 12:23:45] [INFO] sleeping up to 10 seconds until next reconnection attempt
  24. [2024-01-16 12:23:55] [INFO] checking state of node "master" (ID: 1), 6 of 6 attempts
  25. [2024-01-16 12:23:55] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  26. [2024-01-16 12:23:55] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  27. [2024-01-16 12:23:55] [WARNING] unable to reconnect to node "master" (ID: 1) after 6 attempts
  28. [2024-01-16 12:23:55] [INFO] 2 active sibling nodes registered
  29. [2024-01-16 12:23:55] [INFO] 4 total nodes registered
  30. [2024-01-16 12:23:55] [INFO] primary node "master" (ID: 1) and this node have the same location ("myrepmgr")
  31. [2024-01-16 12:23:55] [INFO] local node's last receive lsn: 0/2A0000A0
  32. [2024-01-16 12:23:55] [INFO] checking state of sibling node "slave2" (ID: 3)
  33. [2024-01-16 12:23:55] [INFO] node "slave2" (ID: 3) reports its upstream is node 1, last seen 51 second(s) ago
  34. [2024-01-16 12:23:55] [INFO] standby node "slave2" (ID: 3) last saw primary node 51 second(s) ago
  35. [2024-01-16 12:23:55] [INFO] last receive LSN for sibling node "slave2" (ID: 3) is: 0/2A0000A0
  36. [2024-01-16 12:23:55] [INFO] node "slave2" (ID: 3) has same LSN as current candidate "slave1" (ID: 2)
  37. [2024-01-16 12:23:55] [INFO] checking state of sibling node "witness" (ID: 4)
  38. [2024-01-16 12:23:55] [INFO] node "witness" (ID: 4) reports its upstream is node 1, last seen 51 second(s) ago
  39. [2024-01-16 12:23:55] [INFO] witness node "witness" (ID: 4) last saw primary node 51 second(s) ago
  40. [2024-01-16 12:23:55] [INFO] visible nodes: 3; total nodes: 3; no nodes have seen the primary within the last 4 seconds
  41. [2024-01-16 12:23:55] [NOTICE] promotion candidate is "slave1" (ID: 2)
  42. [2024-01-16 12:23:55] [NOTICE] this node is the winner, will now promote itself and inform other nodes
  43. [2024-01-16 12:23:55] [INFO] promote_command is:
  44. "/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
  45. [2024-01-16 12:23:55] [NOTICE] redirecting logging output to "/opt/pg/pglog/repmgr.log"
  46. [2024-01-16 12:23:55] [WARNING] 2 sibling nodes found, but option "--siblings-follow" not specified
  47. [2024-01-16 12:23:55] [DETAIL] these nodes will remain attached to the current primary:
  48. slave2 (node ID: 3)
  49. witness (node ID: 4, witness server)
  50. [2024-01-16 12:23:55] [NOTICE] promoting standby to primary
  51. [2024-01-16 12:23:55] [DETAIL] promoting server "slave1" (ID: 2) using pg_promote()
  52. [2024-01-16 12:23:55] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  53. [2024-01-16 12:23:56] [NOTICE] STANDBY PROMOTE successful
  54. [2024-01-16 12:23:56] [DETAIL] server "slave1" (ID: 2) was successfully promoted to primary
  55. [2024-01-16 12:23:56] [INFO] checking state of node 2, 1 of 6 attempts
  56. [2024-01-16 12:23:56] [NOTICE] node 2 has recovered, reconnecting
  57. [2024-01-16 12:23:56] [INFO] connection to node 2 succeeded
  58. [2024-01-16 12:23:56] [INFO] original connection is still available
  59. [2024-01-16 12:23:56] [INFO] 2 followers to notify
  60. [2024-01-16 12:23:56] [NOTICE] notifying node "slave2" (ID: 3) to follow node 2
  61. [2024-01-16 12:23:56] [ERROR] unable to execute repmgr.notify_follow_primary()
  62. [2024-01-16 12:23:56] [DETAIL]
  63. FATAL: terminating connection due to administrator command
  64. server closed the connection unexpectedly
  65. This probably means the server terminated abnormally
  66. before or while processing the request.
  67. [2024-01-16 12:23:56] [DETAIL] query text is:
  68. SELECT repmgr.notify_follow_primary(2)
  69. [2024-01-16 12:23:56] [NOTICE] notifying node "witness" (ID: 4) to follow node 2
  70. INFO: node 4 received notification to follow node 2
  71. [2024-01-16 12:23:56] [INFO] switching to primary monitoring mode
  72. [2024-01-16 12:23:56] [NOTICE] monitoring cluster primary "slave1" (ID: 2)
  73. [2024-01-16 12:24:02] [NOTICE] new standby "slave2" (ID: 3) has connected
  74. [2024-01-16 12:24:02] [NOTICE] new witness "witness" (ID: 4) has connected

1)从日志中看到主节点宕机后repmgr尝试探测了6次,主节点依然不能探活;

2)最新的主节点的时间线

[2024-01-16 12:23:55] [INFO] local node's last receive lsn: 0/2A0000A0

3)slave2节点拥有与主节点一样的时间线

[2024-01-16 12:23:55] [INFO] last receive LSN for sibling node "slave2" (ID: 3) is: 0/2A0000A0

4)slave2和候选节点slave1有相关的时间线

[2024-01-16 12:23:55] [INFO] node "slave2" (ID: 3) has same LSN as current candidate "slave1" (ID: 2)

5)再找witness见证节点进行验证主节点是否存活

[2024-01-16 12:23:55] [INFO] witness node "witness" (ID: 4) last saw primary node 51 second(s) ago

3)最后确认可用节点3个,没有节点在过去4秒内看见到主节点

[2024-01-16 12:23:55] [INFO] visible nodes: 3; total nodes: 3; no nodes have seen the primary within the

4)提升候选节点slave1为主节点

[2024-01-16 12:23:55] [NOTICE] promotion candidate is "slave1" (ID: 2)

...

[2024-01-16 12:23:56] [DETAIL] server "slave1" (ID: 2) was successfully promoted to primary

5)slave2和witness重新连接到slave1新主库

[2024-01-16 12:24:02] [NOTICE] new standby "slave2" (ID: 3) has connected

[2024-01-16 12:24:02] [NOTICE] new witness "witness" (ID: 4) has connected

slave2查看集群状态

su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.5.3 恢复原master为备库

  1. #master节点操作,原来的master修复后进行恢复
  2. #删除pg的数据
  3. rm -rf /opt/pg/pgdata/*
  4. #执行同步主库的数据
  5. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -h 192.168.16.221 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone"
  6. #启动slave2节点数据库
  7. su - postgres
  8. pg_ctl start -w -D $PGDATA -l /opt/pg/pglog/startup.log
  9. #master节点,将master数据库注册到集群,由于之前创建过,需要加--force进行覆盖
  10. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby register --force"
  11. #查看最新集群状态
  12. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.5.4 网络中断后witness节点怎么防止脑裂

场景:当主节点master由于网络问题导致备库不能访问,这时候如果备库repmgrd在ping不通主库的情况,repmgr会认为主库已经宕机,会启动故障切换,但是实际主库并没有宕机仅仅是由于网络问题导致不能访问,如果这是提升从库为主库,在整个架构中就会出现两个主库,导致脑裂;如果在有witness见证节点的架构,repmgrd除了检测备库到主库的网络,也会检测witness节点到主库的网络,如果备库到主库是不通,但是witness到主库是通的,那么会认为主库没有宕机仅仅是由于网络原因导致备库不能访问,这是不会引起故障切换,会防止脑裂发生。

模拟场景

1)查看集群的当前状态

su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2)模拟主节点网络故障

  1. #在主库创建防火墙规则,来限制两个备库的repmgr访问,模拟网络故障
  2. iptables -A INPUT -m iprange --src-range 192.168.16.221-192.168.16.222 -j DROP
  3. #或单条增加
  4. iptables -A INPUT -s 192.168.16.221 -j DROP
  5. iptables -A INPUT -s 192.168.16.222 -j DROP

3) 查看slave1的repmgr日志

tail -fn100 /opt/pg/pglog/repmgr.log

  1. [2024-01-16 16:10:06] [WARNING] unable to ping "host=192.168.16.220 user=repmgr dbname=repmgr connect_timeout=2"
  2. [2024-01-16 16:10:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  3. [2024-01-16 16:10:06] [WARNING] unable to connect to upstream node "master" (ID: 1)
  4. [2024-01-16 16:10:06] [INFO] checking state of node "master" (ID: 1), 1 of 6 attempts
  5. [2024-01-16 16:10:08] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  6. [2024-01-16 16:10:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  7. [2024-01-16 16:10:08] [INFO] sleeping up to 10 seconds until next reconnection attempt
  8. [2024-01-16 16:10:18] [INFO] checking state of node "master" (ID: 1), 2 of 6 attempts
  9. [2024-01-16 16:10:20] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  10. [2024-01-16 16:10:20] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  11. [2024-01-16 16:10:20] [INFO] sleeping up to 10 seconds until next reconnection attempt
  12. [2024-01-16 16:10:30] [INFO] checking state of node "master" (ID: 1), 3 of 6 attempts
  13. [2024-01-16 16:10:32] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  14. [2024-01-16 16:10:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  15. [2024-01-16 16:10:32] [INFO] sleeping up to 10 seconds until next reconnection attempt
  16. [2024-01-16 16:10:42] [INFO] checking state of node "master" (ID: 1), 4 of 6 attempts
  17. [2024-01-16 16:10:44] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  18. [2024-01-16 16:10:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  19. [2024-01-16 16:10:44] [INFO] sleeping up to 10 seconds until next reconnection attempt
  20. [2024-01-16 16:10:54] [INFO] checking state of node "master" (ID: 1), 5 of 6 attempts
  21. [2024-01-16 16:10:56] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  22. [2024-01-16 16:10:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  23. [2024-01-16 16:10:56] [INFO] sleeping up to 10 seconds until next reconnection attempt
  24. [2024-01-16 16:11:06] [INFO] checking state of node "master" (ID: 1), 6 of 6 attempts
  25. [2024-01-16 16:11:08] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 fallback_application_name=repmgr"
  26. [2024-01-16 16:11:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  27. [2024-01-16 16:11:08] [WARNING] unable to reconnect to node "master" (ID: 1) after 6 attempts
  28. [2024-01-16 16:11:08] [INFO] 2 active sibling nodes registered
  29. [2024-01-16 16:11:08] [INFO] 4 total nodes registered
  30. [2024-01-16 16:11:08] [INFO] primary node "master" (ID: 1) and this node have the same location ("myrepmgr")
  31. [2024-01-16 16:11:08] [INFO] local node's last receive lsn: 0/10006D68
  32. [2024-01-16 16:11:08] [INFO] checking state of sibling node "slave2" (ID: 3)
  33. [2024-01-16 16:11:08] [INFO] node "slave2" (ID: 3) reports its upstream is node 1, last seen 1 second(s) ago
  34. [2024-01-16 16:11:08] [NOTICE] standby node "slave2" (ID: 3) last saw primary node 1 second(s) ago, considering primary still visible
  35. [2024-01-16 16:11:08] [INFO] last receive LSN for sibling node "slave2" (ID: 3) is: 0/100073A0
  36. [2024-01-16 16:11:08] [INFO] node "slave2" (ID: 3) is ahead of current candidate "slave1" (ID: 2)
  37. [2024-01-16 16:11:08] [INFO] checking state of sibling node "witness" (ID: 4)
  38. [2024-01-16 16:11:08] [INFO] node "witness" (ID: 4) reports its upstream is node 1, last seen 1 second(s) ago
  39. [2024-01-16 16:11:08] [NOTICE] witness node "witness" (ID: 4) last saw primary node 1 second(s) ago, considering primary still visible
  40. [2024-01-16 16:11:08] [INFO] 2 nodes can see the primary
  41. [2024-01-16 16:11:08] [DETAIL] following nodes can see the primary:
  42. - node "witness" (ID: 4): 1 second(s) ago
  43. [2024-01-16 16:11:08] [NOTICE] cancelling failover as some nodes can still see the primary
  44. [2024-01-16 16:11:08] [NOTICE] election cancelled
  45. [2024-01-16 16:11:10] [WARNING] unable to ping "host=192.168.16.220 user=repmgr dbname=repmgr connect_timeout=2"
  46. [2024-01-16 16:11:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  47. [2024-01-16 16:11:10] [INFO] node "slave1" (ID: 2) monitoring upstream node "master" (ID: 1) in degraded state
  48. [2024-01-16 16:11:10] [DETAIL] waiting for upstream or another primary to reappear

日志说明:

i)除了备库不能看到主节点,还有witness节点能看到主节点

[2024-01-16 16:11:08] [DETAIL] following nodes can see the primary:

- node "witness" (ID: 4): 1 second(s) ago

ii)取消切换,原因是还有一些节点能够看到主节点,这样就防止由于主节点网络问题导致脑裂(原因本身主库并没有宕机)

[2024-01-16 16:11:08] [NOTICE] cancelling failover as some nodes can still see the primary

iii)该节点在等待上游主节点恢复和新的主节点出现

[2024-01-16 16:11:10] [DETAIL] waiting for upstream or another primary to reappear

4)slave1查看集群状态

su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

5)主库模拟恢复网络

  1. #[删除方法]
  2. #1)删除防火墙规则命令请参考
  3. iptables -D INPUT -m iprange --src-range 192.168.16.221-192.168.16.222 -j DROP
  4. #2)也可以单条删除
  5. iptables -D INPUT -s 192.168.16.221 -j DROP
  6. iptables -D INPUT -s 192.168.16.222 -j DROP
  7. #3)或者通过规则id删除
  8. iptables -L --line-numbers
  9. #然后通过列出的规则ID进行删除
  10. iptables -D INPUT <id>

6)查看集群恢复情况

备库查看日志:tail -fn100 /opt/pg/pglog/repmgr.log

  1. [2024-01-16 16:36:56] [NOTICE] reconnected to upstream node "master" (ID: 1) after 330 seconds, resuming monitoring
  2. [2024-01-16 16:36:56] [INFO] node "slave1" (ID: 2) monitoring upstream node "master" (ID: 1) in normal state
  3. [2024-01-16 16:37:06] [INFO] node "slave1" (ID: 2) monitoring upstream node "master" (ID: 1) in normal state
  4. [2024-01-16 16:37:16] [INFO] node "slave1" (ID: 2) monitoring upstream node "master" (ID: 1) in normal state
  5. [2024-01-16 16:37:27] [INFO] node "slave1" (ID: 2) monitoring upstream node "master" (ID: 1) in normal state
  6. [2024-01-16 16:37:37] [INFO] node "slave1" (ID: 2) monitoring upstream node "master" (ID: 1) in normal state

从以上日志看出已重新连接到主数据库

查看集群状态已恢复正常

su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.5.5 repmgr管理PostgreSQL

功能验证:目前repmgr不能操作pg数据库的启动,所以这里先跳过

为了repmgr更好的接管pg库的启停管理,需要将postgresql数据库的启动进程交给systemd来管理,这样就可以用systemctl来操作啦。

配置如下:

三个节点执行

  1. vi /usr/lib/systemd/system/postgres-12.service
  2. [Unit]
  3. Description=PostgreSQL 12 database server
  4. After=syslog.target network.target
  5. [Service]
  6. Type=forking
  7. TimeoutSec=120
  8. User=postgres
  9. Environment="PGHOME=/opt/pg/pgsql/12.15/"
  10. Environment="PGDATA=/opt/pg/pgdata"
  11. Environment="PGPORT=5432"
  12. Environment="LD_LIBRARY_PATH=/opt/pg/pgsql/12.15/lib:/usr/local/lib:/usr/local/lib64:/usr/lib64"
  13. ExecStart=/bin/bash -c '${PGHOME}/bin/pg_ctl start -w -D ${PGDATA} -l /opt/pg/pglog/startup.log'
  14. ExecStop=/bin/bash -c '${PGHOME}/bin/pg_ctl stop -m fast -w -D ${PGDATA}'
  15. ExecReload=/bin/bash -c '${PGHOME}/bin/pg_ctl reload -D ${PGDATA}'
  16. [Install]
  17. WantedBy=multi-user.target

用systemctl启停数据库

  1. #需要注意的是要如果原来pg库已通过pg_ctl启动,需要先停止后再执行
  2. systemctl daemon-reload
  3. systemctl restart postgres-12

修改repmgr配置(每个节点都需要修改)

vi /etc/repmgr.conf

  1. #repmgr基本配置信息
  2. node_id=1
  3. node_name='master'
  4. conninfo='host=192.168.16.220 user=repmgr dbname=repmgr connect_timeout=2'
  5. data_directory='/opt/pg/pgdata'
  6. replication_user=repmgr
  7. replication_type=physical
  8. location=myrepmgr
  9. #repmgr日志配置
  10. log_level=INFO
  11. log_facility=STDERR
  12. log_file='/opt/pg/pglog/repmgr.log'
  13. log_status_interval=10
  14. #可执行文件配置
  15. pg_bindir='/opt/pg/pgsql/12.15/bin'
  16. #集群faibver设置
  17. failover='automatic'
  18. promote_command='/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
  19. follow_command='/opt/pg/pgsql/12.15/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
  20. #管理postgresql启停
  21. service_start_command='sudo systemctl start postgres-12'
  22. service_stop_command='sudo systemctl stop postgres-12'
  23. service_restart_command='sudo systemctl restart postgres-12'
  24. service_reload_command='sudo systemctl reload postgres-12'

控制pg数据库操作

  1. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf node service --list-actions --action=stop"
  2. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf node service --list-actions --action=start"
  3. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf node service --list-actions --action=restart"
  4. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf node service --list-actions --action=reload"

2.5.6 其他操作命令

  1. #主备切换并查看
  2. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby switchover --siblings-follow -U repmgr --verbose"
  3. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"
  4. #从库重新跟随新主库
  5. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby follow"
  6. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"
  7. #驱逐备库节点
  8. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr standby unregister -f /etc/repmgr.conf"
  9. #注销不活动的主节点
  10. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr primary unregister -f /etc/repmgr.conf"
  11. #主节点故障时,手动升级备库为主节点
  12. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr standby promote -f /etc/repmgr.conf --siblings-follow"
  13. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby follow"
  14. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"
  15. #故障节点修复后,重新加入集群
  16. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr node rejoin -d 'host=slave2 user=repmgr dbname=repmgr' --force-rewind --verbose -f /etc/repmgr.conf"
  17. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"
  18. #强制重新注册为主节点
  19. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf primary register --force"
  20. #强制重新注册为备节点
  21. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby register --force"
  22. #节点通信检测
  23. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster matrix"
  24. #节点状态信息
  25. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf service status --detail"
  26. #节点连接信息
  27. su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster crosscheck"

命令

命令说明

repmgr primary register

初始化repmgr安装并注册主节点

repmgr primary unregister

注销非活动主节点

repmgr standby clone

从另一个PostgreSQL节点克隆PostgreSQL备用节点

repmgr standby register

将备用的信息添加到repmgr元数据中

repmgr standby unregister

从repmgr元数据中删除备用的信息

repmgr standby promote

将备用升级为主

repmgr standby follow

将正在运行的备用连接到新的上游节点

repmgr standby switchover

将备用升级为主,并将现有主降级为备用

repmgr witness register

将见证节点的信息添加到repmgr元数据中

repmgr witness unregister

将见证节点的信息删除到repmgr元数据中

repmgr node status

显示节点的基本信息和复制状态的概述

repmgr node check

从复制的角度对节点执行一些运行状况检查

repmgr node rejoin

将休眠(停止)节点重新加入复制群集

repmgr node service

显示或执行系统服务命令以停止/启动/重新启动/重新加载/提升节点

repmgr cluster show

显示复制群集中每个注册节点的信息

repmgr cluster matrix

在每个节点上运行repmgr群集显示并总结输出

repmgr cluster crosscheck

交叉检查每个节点组合之间的连接

repmgr cluster event

输出格式化的集群事件列表

repmgr cluster cleanup

清除监视历史记录

repmgr service status

显示集群中每个节点上repmgr的状态信息

repmgr service pause

指示复制集群中的所有repmgrd实例暂停故障切换操作

repmgr service unpause

指示复制集群中的所有repmgrd实例恢复故障切换操作

repmgr daemon start

在本地节点上启动repmgrd守护进程

repmgr daemon stop

在本地节点上停止repmgrd守护进程

2.5.7 常见报错

参考微博:https://www.cnblogs.com/Jeona/p/17378872.html

2.5.7.1 手动执行主从切换时报错

  1. #su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby switchover --siblings-follow -U repmgr --verbose"
  2. NOTICE: using provided configuration file "/etc/repmgr.conf"
  3. WARNING: following problems with command line parameters detected:
  4. database connection parameters not required when executing STANDBY SWITCHOVER
  5. NOTICE: executing switchover on node "master" (ID: 1)
  6. INFO: searching for primary node
  7. INFO: checking if node 2 is primary
  8. INFO: current primary node is 2
  9. WARNING: unable to connect to remote host "192.168.16.221" via SSH
  10. ERROR: unable to connect via SSH to host "192.168.16.221", user ""

报错说明:主节点repmgr通过ssh登录远程节点192.168.16.221时不能登录

解决方案

switchover 需要无密码登录,因此配置postgres 用户免密登录即可

  1. su - postgres
  2. ssh-keygen -t rsa
  3. for i in 192.168.16.220 192.168.16.221 192.168.16.222 192.168.16.223;do ssh-copy-id -i $i;done

再次执行成功切换

  1. # su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf standby switchover --siblings-follow -U repmgr --verbose --force"
  2. NOTICE: using provided configuration file "/etc/repmgr.conf"
  3. WARNING: following problems with command line parameters detected:
  4. database connection parameters not required when executing STANDBY SWITCHOVER
  5. NOTICE: executing switchover on node "master" (ID: 1)
  6. INFO: searching for primary node
  7. INFO: checking if node 2 is primary
  8. INFO: current primary node is 2
  9. INFO: SSH connection to host "192.168.16.221" succeeded
  10. INFO: 2 active sibling nodes found
  11. INFO: all sibling nodes are reachable via SSH
  12. INFO: 0 pending archive files
  13. INFO: replication lag on this standby is 0 seconds
  14. NOTICE: attempting to pause repmgrd on 4 nodes
  15. NOTICE: local node "master" (ID: 1) will be promoted to primary; current primary "slave1" (ID: 2) will be demoted to standby
  16. NOTICE: stopping current primary node "slave1" (ID: 2)
  17. NOTICE: issuing CHECKPOINT on node "slave1" (ID: 2)
  18. DETAIL: executing server command "/opt/pg/pgsql/12.15/bin/pg_ctl -D '/opt/pg/pgdata' -W -m fast stop"
  19. INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
  20. INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
  21. NOTICE: current primary has been cleanly shut down at location 0/30000028
  22. NOTICE: promoting standby to primary
  23. DETAIL: promoting server "master" (ID: 1) using pg_promote()
  24. NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  25. INFO: standby promoted to primary after 1 second(s)
  26. NOTICE: STANDBY PROMOTE successful
  27. DETAIL: server "master" (ID: 1) was successfully promoted to primary
  28. INFO: node "slave1" (ID: 2) is pingable
  29. INFO: node "slave1" (ID: 2) has attached to its upstream node
  30. NOTICE: node "master" (ID: 1) promoted to primary, node "slave1" (ID: 2) demoted to standby
  31. NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
  32. INFO: node 4 received notification to follow node 1
  33. INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
  34. NOTICE: switchover was successful
  35. DETAIL: node "master" is now primary and node "slave1" is attached as standby
  36. NOTICE: STANDBY SWITCHOVER has completed successfully

查看最新集群状态

su - postgres -c "/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr.conf cluster show"

2.5.7.2 连接被拒绝

报错内容:

  1. #/opt/pg/pgsql/12.15/bin/repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run
  2. INFO: connecting to primary database...
  3. DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 port=5432 fallback_application_name=repmgr options=-csearch_path="
  4. ERROR: connection to database failed
  5. DETAIL:
  6. connection to server at "192.168.16.220", port 5432 failed: Connection refused
  7. Is the server running on that host and accepting TCP/IP connections?
  8. DETAIL: attempted to connect using:
  9. user=repmgr connect_timeout=2 dbname=repmgr host=192.168.16.220 port=5432 fallback_application_name=repmgr options=-csearch_path=

解决方案:

需要将pg数据库的postgresql.conf配置中的listen-addresses ='*'

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/449825
推荐阅读
相关标签
  

闽ICP备14008679号