• 售前

  • 售后

热门帖子
入门百科

postgres主备切换之文件触发方式详解

[复制链接]
山风点烟捶 显示全部楼层 发表于 2021-10-26 12:44:03 |阅读模式 打印 上一主题 下一主题
本文测试参考PostgresSQL实战一书。
本文档测试情况:

主库IP:192.168.40.130 主机名:postgres 端口:5442
备库IP: 192.168.40.131 主机名:postgreshot 端口:5442
PostgreSQL9.0版本流复制主备切换只能通过创建触发文件方式进行,这一末节将介绍这种主备切换方式,测试情况为一主一备异步流复制情况,postgres上的数据库为主库,postgreshot上的数据库为备库,文件触发方式的手工主备切换紧张步调如下:
1)配置备库recovery.conf文件trigger_file参数,设置激活备库的触发文件路径和名称。
2)关闭主库,建议使用-m fast模式关闭。
3)在备库上创建触发文件激活备库,如果recovery.conf变成recovery.done表现备库已经切换成主库。
4)这时需要将老的主库切换成备库,在老的主库的$PGDATA目录下创建recovery.conf文件(如果此目录下不存在recovery.conf文件,可以根据$PGHOME/share/recovery.conf.sample模板文件复制一个,如果此目录下存在recovery.done文件,需将recovery.done文件重定名为recovery.conf),配置和老的从库一样,只是primary_conninfo参数中的IP换成对端IP。
5)启动老的主库,这时观察主、备历程是否正常,如果正常表现主备切换成功。
1、起首在备库上配置recovery.conf,如下所示:
  1. [postgres@postgreshot pg11]$ cat recovery.conf | grep -v '^#'
  2. recovery_target_timeline = 'latest'
  3. standby_mode = on
  4. primary_conninfo = 'host=192.168.40.130 port=5442 user=replica application_name=pg1'  # e.g. 'host=localhost port=5432'
  5. trigger_file = '/home/postgres/pg11/trigger'
  6. [postgres@postgreshot pg11]$
复制代码
trigger_file可以配置成平凡文件或隐蔽文件,调解以上参数后需重启备库使配置参数生效。
2、关闭主库,如下所示:
  1. [postgres@postgres pg11]$ pg_ctl stop -m fast
  2. waiting for server to shut down.... done
  3. server stopped
  4. [postgres@postgres pg11]$
复制代码
3、在备库上创建触发文件激活备库,如下所示:
  1. [postgres@postgreshot pg11]$ ll recovery.conf
  2. -rwx------ 1 postgres postgres 5.9K Mar 26 18:47 recovery.conf
  3. [postgres@postgreshot pg11]$
  4. [postgres@postgreshot pg11]$ touch /home/postgres/pg11/trigger
  5. [postgres@postgreshot pg11]$ ll recovery*
  6. -rwx------ 1 postgres postgres 5.9K Mar 26 18:47 recovery.done
  7. [postgres@postgreshot pg11]$
复制代码
触发器文件名称和路径需和recovery.conf配置文件trigger_file保持一致,再次检察recovery文件时,发现后辍由原来的.conf变成了.done
检察备库数据库日记,如下所示:
  1. 2019-03-26 23:30:19.399 EDT [93162] LOG: replication terminated by primary server
  2. 2019-03-26 23:30:19.399 EDT [93162] DETAIL: End of WAL reached on timeline 3 at 0/50003D0.
  3. 2019-03-26 23:30:19.399 EDT [93162] FATAL: could not send end-of-streaming message to primary: no COPY in progress
  4. 2019-03-26 23:30:19.399 EDT [93158] LOG: invalid record length at 0/50003D0: wanted 24, got 0
  5. 2019-03-26 23:30:19.405 EDT [125172] FATAL: could not connect to the primary server: server closed the connection unexpectedly
  6.   This probably means the server terminated abnormally
  7.   before or while processing the request.
  8. 2019-03-26 23:30:24.410 EDT [125179] FATAL: could not connect to the primary server: could not connect to server: Connection refused
  9.   Is the server running on host "192.168.40.130" and accepting
  10.   TCP/IP connections on port 5442?
  11. 2019-03-26 23:31:49.505 EDT [93158] LOG: trigger file found: /home/postgres/pg11/trigger
  12. 2019-03-26 23:31:49.506 EDT [93158] LOG: redo done at 0/5000360
  13. 2019-03-26 23:31:49.506 EDT [93158] LOG: last completed transaction was at log time 2019-03-26 19:03:11.202845-04
  14. 2019-03-26 23:31:49.516 EDT [93158] LOG: selected new timeline ID: 4
  15. 2019-03-26 23:31:50.063 EDT [93158] LOG: archive recovery complete
  16. 2019-03-26 23:31:50.083 EDT [93157] LOG: database system is ready to accept connections
复制代码
根据备库以上信息,由于关闭了主库,起首日记显示毗连不上主库,接着显示发现了触发文件,之后显示恢复成功,数据库切换成读写模式。
这时根据pg_controldata输出进行验证,如下所示:
  1. [postgres@postgreshot ~]$ pg_controldata | grep cluster
  2. Database cluster state:  in production
  3. [postgres@postgreshot ~]$
复制代码
以上显示数据库脚色已经是主库脚色,在postgreshot上创建一张名为test_alived的表并插入数据,如下所示:
  1. postgres=# CREATE TABLE test_alived2(id int4);
  2. CREATE TABLE
  3. postgres=# INSERT INTO test_alived2 VALUES(1);
  4. INSERT 0 1
  5. postgres=#
复制代码
4、预备将老的主库切换成备库脚色,在老的主库上配置recovery.conf,如下所示:
  1. [postgres@postgres pg11]$ cat recovery.conf | grep -v '^#'
  2. recovery_target_timeline = 'latest'
  3. standby_mode = on
  4. primary_conninfo = 'host=192.168.40.131 port=5442 user=replica application_name=pg2'  # e.g. 'host=localhost port=5432'
  5. trigger_file = '/home/postgres/pg11/trigger'
  6. [postgres@postgres pg11]$
复制代码
以上配置和postgreshot上的recovery.done配置文件根本一致,只是primary_conninfo参数的host选项配置成对端主机IP。
之后在postgres主机用户家目录创建~/.pgpass文件,如下所示:
  1. [postgres@pghost1 ~]$ touch ~/.pgpass
  2. [postgres@pghost1 ~]$ chmod 600 ~/.pgpass
复制代码
并在~/.pgpass文件中插入以下内容:
  1. [postgres@postgres ~]$ cat .pgpass
  2. 192.168.40.130:5442:replication:replica:replica
  3. 192.168.40.131:5442:replication:replica:replica
  4. [postgres@postgres ~]
复制代码
之后启动postgres上的数据库,如下所示:
  1. [postgres@postgres ~]$ pg_ctl start
  2. waiting for server to start....2019-03-26 23:38:50.424 EDT [55380] LOG: listening on IPv4 address "0.0.0.0", port 5442
  3. 2019-03-26 23:38:50.424 EDT [55380] LOG: listening on IPv6 address "::", port 5442
  4. 2019-03-26 23:38:50.443 EDT [55380] LOG: listening on Unix socket "/tmp/.s.PGSQL.5442"
  5. 2019-03-26 23:38:50.465 EDT [55381] LOG: database system was shut down in recovery at 2019-03-26 23:38:20 EDT
  6. 2019-03-26 23:38:50.465 EDT [55381] LOG: entering standby mode
  7. 2019-03-26 23:38:50.483 EDT [55381] LOG: consistent recovery state reached at 0/50003D0
  8. 2019-03-26 23:38:50.483 EDT [55381] LOG: invalid record length at 0/50003D0: wanted 24, got 0
  9. 2019-03-26 23:38:50.483 EDT [55380] LOG: database system is ready to accept read only connections
  10. done
  11. server started
  12. [postgres@postgres ~]$ 2019-03-26 23:38:50.565 EDT [55385] LOG: fetching timeline history file for timeline 4 from primary server
  13. 2019-03-26 23:38:50.588 EDT [55385] LOG: started streaming WAL from primary at 0/5000000 on timeline 3
  14. 2019-03-26 23:38:50.589 EDT [55385] LOG: replication terminated by primary server
  15. 2019-03-26 23:38:50.589 EDT [55385] DETAIL: End of WAL reached on timeline 3 at 0/50003D0.
  16. 2019-03-26 23:38:50.592 EDT [55381] LOG: new target timeline is 4
  17. 2019-03-26 23:38:50.594 EDT [55385] LOG: restarted WAL streaming at 0/5000000 on timeline 4
  18. 2019-03-26 23:38:50.717 EDT [55381] LOG: redo starts at 0/50003D0
  19. [postgres@postgres ~]$ pg_controldata | grep cluster
  20. Database cluster state:  in archive recovery
  21. [postgres@postgres ~]$
  22. postgres=# select * from test_alived2;
  23. id
  24. ----
  25. 1
  26. (1 row)
  27. postgres=#
复制代码
同时,postgres上已经有了WAL接收历程,postgreshot上有了WAL发送历程,阐明老的主库已经成功切换成备库,以上是主备切换的所有步调。
为什么在步调2中需要干净地关闭主库?数据库关闭时起首做一次checkpoint,完成之后通知WAL发送历程要关闭了,WAL发送历程会将停止此次checkpoint的WAL日记流发送给备库的WAL接收历程,备节点接收到主库最后发送来的WAL日记流后应用WAL,从而达到了和主库一致的状态。
另一个需要注意的题目是如果主库主机非常宕机了,如果激活备库,备库的数据完全和主库一致吗?此情况为一主一备异步流复制情况,备库和主库是异步同步方式,存在延时,这时主库上已提交事务的WAL有可能还没来得及发送给备库,主库主机就已经宕机了,因此异步流复制备库可能存在事务丢失的风险。
主备切换之pg_ctl promote方式

上面介绍了以文件触发方式进行主备切换,PostgreSQL9.1版本开始支持pg_ctl promote触发方式,相比文件触发方式使用更方便,promote下令语法如下:
  1. pg_ctl promote [-D datadir]
复制代码
-D是指数据目录,如果不指定会使用情况变量$PGDATA设置的值。promote下令发出后,运行中的备库将制止恢复模式并切换成读写模式的主库。
pg_ctl promote主备切换步调和文件触发方式大体雷同,只是步调1中不需要配置recovery.conf配置文件中的trigger_file参数,而且步调3中换成以pg_ctl promote方式进行主备切换,如下:
1)关闭主库,建议使用-m fast模式关闭。
2)在备库上执行pg_ctl promote下令激活备库,如果recovery.conf变成recovery.done表现备库已切换成为主库。
3)这时需要将老的主库切换成备库,在老的主库的$PGDATA目录下创建recovery.conf文件(如果此目录下不存在recovery.conf文件,可以根据$PGHOME/share/recovery.conf.sample模板文件复制一个,如果此目录下存在recovery.done文件,需将recovery.done文件重定名为recovery.conf),配置和老的从库一样,只是primary_conninfo参数中的IP换成对端IP。
4)启动老的主库,这时观察主、备历程是否正常,如果正常表现主备切换成功。以上是pg_ctl promote主备切换的紧张步调,这一末节不进行演示了,下一末节介绍pg_rewind工具时会给出使用pg_ctl promote进行主备切换的示例
  1. pg_rewind
复制代码
pg_rewind是流复制维护时一个非常好的数据同步工具,在上一节介绍流复制主备切换内容中讲到了紧张有五个步调进行主备切换,此中步调2是在激活备库前先关闭主库,如果不做步调2会出现什么样的情况?下面我们举例进行演示,测试情况为一主一备异步流复制情况,postgres上的数据库为主库,postgreshot上的数据库为备库。
主备切换

--备节点 recovery.conf 配置: postgreshot 上使用
备库recovery.conf配置如下所示:
  1. [postgres@postgreshot pg11]$ cat recovery.conf | grep -v '^#'
  2. recovery_target_timeline = 'latest'
  3. standby_mode = on
  4. primary_conninfo = 'host=192.168.40.130 port=5442 user=replica application_name=pg1'  # e.g. 'host=localhost port=5432'
  5. trigger_file = '/home/postgres/pg11/trigger'
  6. [postgres@postgreshot pg11]$
复制代码
--激活备节点: postgreshot 上使用
检查流复制状态,确保正常后在备库主机上执行以下下令激活备库,如下所示
  1. [postgres@postgreshot pg11]$ pg_ctl promote -D $PGDATA
  2. waiting for server to promote.... done
  3. server promoted
  4. [postgres@postgreshot pg11]$
  5. [postgres@postgreshot pg11]$
复制代码
检察备库数据库日记,可以或许看到数据库正常打开接收外部毗连的信息,这阐明激活成功,检查postgreshot上的数据库脚色,如下所示:
  1. [postgres@postgreshot pg11]$ pg_controldata | grep cluster
  2. Database cluster state:  in production
  3. [postgres@postgreshot pg11]$
复制代码
从pg_controldata输出也可以看到postgreshot上的数据库已成为主库,阐明postgreshot上的数据库已经切换成主库,这时老的主库(postgres上的数据库)依然还在运行中,我们计划将postgres上的脚色转换成备库,先检察postgres上的数据库脚色,如下所示
  1. [postgres@postgres pg11]$ pg_controldata | grep cluster
  2. Database cluster state:  in production
  3. [postgres@postgres pg11]$
复制代码
--备节点激活后,创建一张测试表并插入数据
  1. postgres=# create table test_1(id int4);
  2. CREATE TABLE
  3. postgres=# insert into test_1(id) select n from generate_series(1,10) n;
  4. INSERT 0 10
  5. postgres=#
复制代码
--停原来主节点: postgres 上使用
  1. [postgres@postgres pg11]$ pg_controldata | grep cluster
  2. Database cluster state:  in production
  3. [postgres@postgres pg11]$ [postgres@postgres pg11]$ pg_ctl stop -m fast -D $PGDATA2019-03-27 01:10:46.714 EDT [64858] LOG: received fast shutdown requestwaiting for server to shut down....2019-03-27 01:10:46.716 EDT [64858] LOG: aborting any active transactions2019-03-27 01:10:46.717 EDT [64858] LOG: background worker "logical replication launcher" (PID 64865) exited with exit code 12019-03-27 01:10:46.718 EDT [64860] LOG: shutting down2019-03-27 01:10:46.731 EDT [64858] LOG: database system is shut down doneserver stopped[postgres@postgres pg11]$
复制代码
--pg_rewind: postgres 上使用
  1. [postgres@postgreshot pg11]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.40.131 port=5442 user=replica password=replica'
  2. target server needs to use either data checksums or " = on"
  3. Failure, exiting
  4. [postgres@postgreshot pg11]$
复制代码
备注:数据库在 initdb 时需要开启 checksums 大概设置 "wal_log_hints = on", 接着设置主,备节点的 wal_log_hints 参数并重启数据库。
  1. [postgres@postgres pg11]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.40.131 port=5442 user=replica password=replica'
  2. servers diverged at WAL location 0/70001E8 on timeline 5
  3. rewinding from last common checkpoint at 0/6000098 on timeline 5
  4. Done!
  5. [postgres@postgres pg11]$
  6. [postgres@postgres pg11]$
复制代码
备注:pg_rewind 成功。
--调解 recovery.conf 文件: postgres 使用
  1. [postgres@postgres pg11]$ mv recovery.done recovery.conf[postgres@postgres pg11]$ [postgres@postgres pg11]$ cat recovery.conf | grep -v '^#'
  2. recovery_target_timeline = 'latest'
  3. standby_mode = on
  4. primary_conninfo = 'host=192.168.40.131 port=5442 user=replica application_name=pg2'  # e.g. 'host=localhost port=5432'
  5. trigger_file = '/home/postgres/pg11/trigger'
  6. [postgres@postgres pg11]$
复制代码
--启动原主库, postgres 上使用
  1. [postgres@postgres pg11]$ pg_ctl start -D $PGDATA
  2. waiting for server to start....2019-03-27 01:14:48.028 EDT [66323] LOG: listening on IPv4 address "0.0.0.0", port 5442
  3. 2019-03-27 01:14:48.028 EDT [66323] LOG: listening on IPv6 address "::", port 5442
  4. 2019-03-27 01:14:48.031 EDT [66323] LOG: listening on Unix socket "/tmp/.s.PGSQL.5442"
  5. 2019-03-27 01:14:48.045 EDT [66324] LOG: database system was interrupted while in recovery at log time 2019-03-27 01:08:08 EDT
  6. 2019-03-27 01:14:48.045 EDT [66324] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
  7. 2019-03-27 01:14:48.084 EDT [66324] LOG: entering standby mode
  8. 2019-03-27 01:14:48.089 EDT [66324] LOG: redo starts at 0/6000060
  9. 2019-03-27 01:14:48.091 EDT [66324] LOG: invalid record length at 0/7024C98: wanted 24, got 0
  10. 2019-03-27 01:14:48.096 EDT [66331] LOG: started streaming WAL from primary at 0/7000000 on timeline 6
  11. 2019-03-27 01:14:48.109 EDT [66324] LOG: consistent recovery state reached at 0/7024CD0
  12. 2019-03-27 01:14:48.110 EDT [66323] LOG: database system is ready to accept read only connections
  13. done
  14. server started
  15. [postgres@postgres pg11]$
  16. [postgres@postgres pg11]$ pg_controldata | grep cluster
  17. Database cluster state:  in archive recovery
  18. [postgres@postgres pg11]$
复制代码
--数据验证, postgres 上使用
  1. [postgres@postgres pg11]$ p
  2. psql (11.1)
  3. Type "help" for help.
  4. postgres=# select count(*) from test_1;
  5. count
  6. -------
  7. 10
  8. (1 row)
  9. postgres=#
复制代码
备注:pg_rewind 成功,原主库现在是以备库脚色启动,而且数据表 test_1 也同步过来了。
pg_rewind 原理

The basic idea is to copy everything from the new cluster to the old cluster, except for the blocks that we know to be the same.
1)Scan the WAL log of the old cluster, starting from the last checkpoint before the point where the new cluster's timeline history forked off from the old cluster. For each WAL record, make a note of the data blocks that were touched. This yields a list of all the data blocks that were changed in the old cluster, after the new cluster forked off.
2)Copy all those changed blocks from the new cluster to the old cluster.
3)Copy all other files like clog, conf files etc. from the new cluster to old cluster. Everything except the relation files.
4) Apply the WAL from the new cluster, starting from the checkpoint created at failover. (Strictly speaking, pg_rewind doesn't apply the WAL, it just creates a backup label file indicating that when PostgreSQL is started, it will start replay from that checkpoint and apply all the required WAL.)
增补:postgres主备搭建时踩坑点
搭建pg主备流复制时的踩坑集锦

1: socket 路径题目 报错如下
你好! 这是你第一次使用 **Markdown编辑器** 所展示的接待页。如果你想学习怎样使用Markdown编辑器,过细阅读这篇文章,相识一下Markdown的根本语法知识。办理方法: 修改postgres.conf中unix_socket_permissions = ‘*' 路径修改为上述报错中的路径 重启即可
2:搭建主备时 备库的data目录肯定 肯定 肯定要使用主库基础备份出来的数据。 可采用pg_basebackup 的方式, 也可以采用tar包 打包 解包的方式 进行基础备份
如果备库不小心已经初始化过 请删除data目录下的* 并使用主库的基础备份重新启动
3:备库启动时报错 FATAL: no pg_hba.conf entry for replication connection from host “172.20.0.16”, user “repl” 之类的题目
例如 master:IP: *.1 standby:IP *.2 主备账号repl
那么在pg_hba.cnf中 单单指明 host replication repl *.2 md5 是不可的
还需在此条记载前面 添加 host all all *.2 md5
起主要能访问主库 才会资格使用repl账号进行同步的步调
以上为个人履历,盼望能给大家一个参考,也盼望大家多多支持脚本之家。如有错误或未考虑完全的地方,望不吝见教。

帖子地址: 

回复

使用道具 举报

分享
推广
火星云矿 | 预约S19Pro,享500抵1000!
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

草根技术分享(草根吧)是全球知名中文IT技术交流平台,创建于2021年,包含原创博客、精品问答、职业培训、技术社区、资源下载等产品服务,提供原创、优质、完整内容的专业IT技术开发社区。
  • 官方手机版

  • 微信公众号

  • 商务合作