赞
踩
- repmgr -f /etc/repmgr.conf primary register
- repmgr -f /etc/repmgr.conf standby register
-
- repmgr -f /etc/repmgr.conf primary unregister -F --node-id=2
- repmgr -f /etc/repmgr.conf standby unregister
克隆之前进行检查
repmgr -h 10.79.21.29 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone --dry-run
真实执行
- $repmgr -h 10.79.21.30 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone
- NOTICE: destination directory "/home/storage/pgsql/data" provided
- INFO: connecting to source node
- DETAIL: connection string is: host=10.79.21.30 user=repmgr dbname=repmgr
- DETAIL: current installation size is 115 MB
- INFO: replication slot usage not requested; no replication slot will be set up for this standby
- NOTICE: checking for available walsenders on the source node (2 required)
- NOTICE: checking replication connections can be made to the source server (2 required)
- INFO: checking and correcting permissions on existing directory "/home/storage/pgsql/data"
- NOTICE: starting backup (using pg_basebackup)...
- HINT: this may take some time; consider using the -c/--fast-checkpoint option
- INFO: executing:
- /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /home/storage/pgsql/data -h 10.79.21.30 -p 5432 -U repmgr -X stream
- NOTICE: standby clone (using pg_basebackup) complete
- NOTICE: you can now start your PostgreSQL server
- HINT: for example: /usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log start
- HINT: after starting the server, you need to register this standby with "repmgr standby register"
如果主服务器发生故障或需要从复制集群中删除,则必须指定新的主服务器,以确保集群继续正常运行。可以通过repmgr standby promote 来完成,它将当前服务器上的备用服务器提升为主服务器。
- $repmgr -f /etc/repmgr.conf cluster show
- ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
- ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
- 1 | node1 | primary | * running | | default | 100 | 3 | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
- 2 | node2 | standby | running | node1 | default | 100 | 3 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop
- $repmgr -f /etc/repmgr.conf cluster show
- ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
- ----+-------+---------+---------------+----------+----------+----------+----------+------------------------------------------------------------------------
- 1 | node1 | primary | ? unreachable | ? | default | 100 | | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
- 2 | node2 | standby | running | ? node1 | default | 100 | 3 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
-
- WARNING: following issues were detected
- - unable to connect to node "node1" (ID: 1)
- - node "node1" (ID: 1) is registered as an active primary but is unreachable
- - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
- HINT: execute with --verbose option to see connection error messages
repmgr -f /etc/repmgr.conf standby promote --log-level=debug --verbose
如果想查看详细的日志输出 可以添加 --log-level=debug --verbose
- $repmgr -f /etc/repmgr.conf standby promote --log-level=debug --verbose
- NOTICE: using provided configuration file "/etc/repmgr.conf"
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: set_config():
- SET synchronous_commit TO 'local'
- INFO: connected to standby, checking its state
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- DEBUG: get_node_record():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
- INFO: searching for primary node
- DEBUG: get_primary_connection():
- SELECT node_id, conninfo, CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority FROM repmgr.nodes WHERE active IS TRUE AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
- INFO: checking if node 1 is primary
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
- ERROR: connection to database failed
- DETAIL:
- could not connect to server: Connection refused
- Is the server running on host "10.79.21.30" and accepting
- TCP/IP connections on port 5432?
-
- DETAIL: attempted to connect using:
- user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path=
- INFO: checking if node 2 is primary
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: set_config():
- SET synchronous_commit TO 'local'
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- DEBUG: get_node_replication_stats():
- SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers, current_setting('max_replication_slots')::INT AS max_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical') AS active_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots, pg_catalog.pg_is_in_recovery() AS in_recovery
- DEBUG: get_active_sibling_node_records():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.upstream_node_id = 1 AND n.node_id != 2 AND n.active IS TRUE ORDER BY n.node_id
- DEBUG: clear_node_info_list() - closing open connections
- DEBUG: clear_node_info_list() - unlinking
- DEBUG: get_node_record():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
- NOTICE: promoting standby to primary
- DETAIL: promoting server "node2" (ID: 2) using pg_promote()
- NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- INFO: standby promoted to primary after 1 second(s)
- DEBUG: setting node 2 as primary and marking existing primary as failed
- DEBUG: begin_transaction()
- DEBUG: commit_transaction()
- NOTICE: STANDBY PROMOTE successful
- DETAIL: server "node2" (ID: 2) was successfully promoted to primary
- DEBUG: _create_event(): event is "standby_promote" for node 2
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- DEBUG: _create_event():
- INSERT INTO repmgr.events ( node_id, event, successful, details ) VALUES ($1, $2, $3, $4) RETURNING event_timestamp
- DEBUG: _create_event(): Event timestamp is "2023-11-15 19:31:25.636843+08"
- DEBUG: clear_node_info_list() - closing open connections
- DEBUG: clear_node_info_list() - unlinking
- $repmgr -f /etc/repmgr.conf cluster show
- ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
- ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
- 1 | node1 | primary | - failed | ? | default | 100 | | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
- 2 | node2 | primary | * running | | default | 100 | 4 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
-
- WARNING: following issues were detected
- - unable to connect to node "node1" (ID: 1)
-
- HINT: execute with --verbose option to see connection error messages
场景
在复制集群的现有主服务器发生故障或删除之后,repmgr standby follow可用于使“孤立”备用服务器成为新的主服务器的从 并追赶上其当前状态。
repmgr -f /etc/repmgr.conf standby follow
在某些情况下,需要以有计划的方式提升备用数据库,例如,主数据库上需要执行维护;repmgr standby swtichover 命令支持这种切换。
repmgr standby switchover
与其他repmgr 操作的不同之处在于,它还在其他服务器(降级候选服务器,以及可选的任何遵循新主服务器的其他服务器)上执行操作,这意味着从执行的服务器到这些服务器需要无密码 SSH 访问 。
repmgr -f /etc/repmgr.conf cluster show
- $repmgr -f /etc/repmgr.conf cluster show
- ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
- ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
- 1 | node1 | primary | * running | | default | 100 | 1 | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
- 2 | node2 | standby | running | node1 | default | 100 | 1 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
切换操作的成功取决于 repmgr能否快速、干净地关闭当前主服务器。
确保被升级的候选者有足够的空闲 walsender 可用(PostgreSQL 配置项max_wal_senders
),并且如果复制槽正在使用中,则至少有一个空闲槽可用于降级候选者(PostgreSQL 配置项max_replication_slots
)。
确保可以从升级候选者(standby)到降级候选者(current primary)进行无密码 SSH 连接。如果--siblings-follow
使用,请确保被从升级的候选者到附加到降级候选者的所有节点(包括 witness server,如果正在使用)可以进行无密码 SSH 连接。
再次检查哪些命令将用于停止/启动/重新启动当前主节点
- repmgr -f /etc/repmgr.conf node service --list-actions --action=stop
- repmgr -f /etc/repmgr.conf node service --list-actions --action=start
- repmgr -f /etc/repmgr.conf node service --list-actions --action=restart
执行前检查
repmgr standby switchover
使用 --dry-run
选项执行前检查;这将执行任何必要的检查并通知成功/失败,并在运行第一个实际命令(关闭当前的主节点)之前停止
repmgr standby switchover -f /etc/repmgr.conf --dry-run --verbose --log-level=debug
- $repmgr standby switchover -f /etc/repmgr.conf --dry-run --verbose --log-level=debug
- NOTICE: using provided configuration file "/etc/repmgr.conf"
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: set_config():
- SET synchronous_commit TO 'local'
- DEBUG: get_node_record():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
- NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- INFO: searching for primary node
- DEBUG: get_primary_connection():
- SELECT node_id, conninfo, CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority FROM repmgr.nodes WHERE active IS TRUE AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
- INFO: checking if node 1 is primary
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: set_config():
- SET synchronous_commit TO 'local'
- DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
- INFO: current primary node is 1
- DEBUG: get_node_record():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 1
- DEBUG: remote node name is "node1"
- DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /bin/true 2>/dev/null
- INFO: SSH connection to host "10.79.21.30" succeeded
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug --version >/dev/null 2>&1 && echo "1" || echo "0"
- DEBUG: remote_command(): output returned was:
- 1
-
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug --version 2>/dev/null
- DEBUG: remote_command(): output returned was:
- repmgr 5.3.3
-
- DEBUG: "repmgr" version on "10.79.21.30" is 50303
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 test -f /etc/repmgr.conf && echo 1 || echo 0
- DEBUG: remote_command(): output returned was:
- 1
-
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --data-directory-config --optformat -LINFO 2>/dev/null
- DEBUG: remote_command(): output returned was:
- --configured-data-directory=OK
-
- INFO: able to execute "repmgr" on remote host "10.79.21.30"
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --replication-config-owner --optformat -LINFO 2>/dev/null
- DEBUG: remote_command(): output returned was:
- --replication-config-owner=OK
-
- DEBUG: get_node_replication_stats():
- SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers, current_setting('max_replication_slots')::INT AS max_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical') AS active_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots, pg_catalog.pg_is_in_recovery() AS in_recovery
- DEBUG: get_active_sibling_node_records():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.upstream_node_id = 1 AND n.node_id != 2 AND n.active IS TRUE ORDER BY n.node_id
- DEBUG: clear_node_info_list() - closing open connections
- DEBUG: clear_node_info_list() - unlinking
- INFO: 1 walsenders required, 10 available
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --remote-node-id=2 --replication-connection
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: remote_command(): output returned was:
- --connection=OK
-
- INFO: demotion candidate is able to make replication connection to promotion candidate
- DEBUG: guc_set():
- SELECT true FROM pg_catalog.pg_settings WHERE name = 'archive_mode' AND setting != 'off'
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --terse -LERROR --archive-ready --optformat
- DEBUG: remote_command(): output returned was:
- --status=OK --files=0
-
- INFO: 0 pending archive files
- DEBUG: get_replication_lag_seconds():
- SELECT CASE WHEN (pg_catalog.pg_last_wal_receive_lsn() = pg_catalog.pg_last_wal_replay_lsn()) THEN 0 ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT END AS lag_seconds
- DEBUG: lag is 0
- INFO: replication lag on this standby is 0 seconds
- DEBUG: get_all_node_records():
- SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n ORDER BY n.node_id
- DEBUG: clear_node_info_list() - closing open connections
- DEBUG: clear_node_info_list() - unlinking
- NOTICE: attempting to pause repmgrd on 2 nodes
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: set_config():
- SET synchronous_commit TO 'local'
- DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
- DEBUG: set_config():
- SET synchronous_commit TO 'local'
- NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
- DEBUG: remote_command():
- ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node service --terse -LERROR --list-actions --action=stop
- DEBUG: remote_command(): output returned was:
- /usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop
-
- INFO: following shutdown command would be run on node "node1":
- "/usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop"
- INFO: parameter "shutdown_check_timeout" is set to 60 seconds
- DEBUG: clear_node_info_list() - closing open connections
- DEBUG: clear_node_info_list() - unlinking
- INFO: prerequisites for executing STANDBY SWITCHOVER are met
repmgr -f /etc/repmgr.conf standby switchover
- $repmgr -f /etc/repmgr.conf standby switchover
- NOTICE: executing switchover on node "node2" (ID: 2)
- NOTICE: attempting to pause repmgrd on 2 nodes
- NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
- NOTICE: stopping current primary node "node1" (ID: 1)
- NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
- DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl -D '/home/storage/pgsql/data' -W -m fast stop"
- INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
- INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
- NOTICE: current primary has been cleanly shut down at location 0/10000028
- NOTICE: promoting standby to primary
- DETAIL: promoting server "node2" (ID: 2) using pg_promote()
- NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
- NOTICE: STANDBY PROMOTE successful
- DETAIL: server "node2" (ID: 2) was successfully promoted to primary
- NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
- NOTICE: switchover was successful
- DETAIL: node "node2" is now primary and node "node1" is attached as standby
- NOTICE: STANDBY SWITCHOVER has completed successfully
- [postgres@ehr-db-mysql-test-s01.zjy:/home/storage/repmgr]$repmgr -f /etc/repmgr.conf cluster show
- ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
- ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
- 1 | node1 | standby | running | node2 | default | 100 | 1 | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
- 2 | node2 | primary | * running | | default | 100 | 2 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
原因 :没有设置pg_bindir参数
解决 : 配置文件添加pg_bindir参数
- $repmgr -f /etc/repmgr.conf standby switchover
- NOTICE: executing switchover on node "node2" (ID: 2)
- ERROR: unable to execute "repmgr" on "10.79.21.30"
- HINT: check "pg_bindir" is set to the correct path in "repmgr.conf"; current value: (not set)
- repmgr -f /etc/repmgr.conf standby switchover
- NOTICE: executing switchover on node "node2" (ID: 2)
- NOTICE: attempting to pause repmgrd on 2 nodes
- NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
- NOTICE: stopping current primary node "node1" (ID: 1)
- NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
- DETAIL: executing server command "pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop"
- INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
- INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
- ...
- INFO: checking for primary shutdown; 60 of 60 attempts ("shutdown_check_timeout")
- ERROR: shutdown of the primary server could not be confirmed
- HINT: check the primary server status before performing any further actions
解决:
参数改为绝对路径
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log start'
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。