Postgresql-9.6 使用pg_rewind主备切换恢复-ghan-ChinaUnix博客

雾里看花ghan.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

ghan

博客访问： 2827587
博文数量： 423
博客积分： 7770
博客等级：少将
技术积分： 4766
用户组：普通用户
注册时间： 2006-11-09 11:58

个人简介

Oracle/DB2/Postgresql/Mysql/Hadoop/Greenplum/Postgres-xl/Mongodb

文章分类

全部博文（423）

Postgres-XL（3）
Greenplum（7）
Bacula（0）
监控系统（1）

ZABBIX（0）
Nginx（4）
Unix（1）
Apache（6）
RHCE认证手记（5）
Postgresql（49）
MSQL server（3）
DB2（60）
Oracle（68）
Mysql（15）
Perl（1）
网络安全技术（19）
FreeBSD（2）
JAVA（11）
网络管理（11）
Linux+C/C++（3）
Kernel（8）
服务器组建（31）
Linux糸统管理（108）
未分配的博文（7）

文章存档

2019年（3）

2018年（6）

2017年（27）

2016年（23）

2015年（30）

2014年（16）

2013年（31）

2012年（73）

2011年（45）

2010年（14）

2009年（30）

2008年（30）

2007年（63）

2006年（32）

我的朋友

相关博文

Postgresql-9.6 使用pg_rewind主备切换恢复

分类： Mysql/postgreSQL

2016-07-19 12:28:00

当前环境：

主节点：108.88.3.116
备节点：108.88.3.180

一、主节点（116）环境配置及状态

[postgres@116 data]$ postgresql.conf

#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -

wal_level = logical # minimal, replica, or logical
# (change requires restart)
#fsync = on # flush data to disk for crash safety
# (turning this off can cause
# unrecoverable data corruption)

#checkpoint_timeout = 5min # range 30s-1h
max_wal_size = 1GB
min_wal_size = 80MB

# - Sending Server(s) -

# Set these on the master and on any standby that will send replication data.

max_wal_senders = 5 # max number of walsender processes
# (change requires restart)
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables

# - Standby Servers -

# These settings are ignored on a master server.

hot_standby = on # "on" allows queries during recovery
# (change requires restart)

# - Archiving -

archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
archive_command = 'cp %p /archivelog/%f' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only

[postgres@116 data]$ vi pg_hba.conf

# IPv4 local connections:
host all all 127.0.0.1/32 trust
host all all 108.88.3.0/24 trust //这里为了方便直接设置trust

[postgres@116 data]$ vi recovery.done //用于节点降为备节点的准备文件,在主节点不起作用，为了切换备节点时方便

#---------------------------------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------------------------------
#
# standby_mode
#
# When standby_mode is enabled, the PostgreSQL server will work as a
# standby. It will continuously wait for the additional XLOG records, using
# restore_command and/or primary_conninfo.
#
standby_mode = on
#
# primary_conninfo
#
# If set, the PostgreSQL server will try to connect to the primary using this
# connection string and receive XLOG records continuously.
#
primary_conninfo = 'host=108.88.3.180 port=5432 user=replication password=6220104' # e.g. 'host=localhost port=5432'
#
# If set, the PostgreSQL server will use the specified replication slot when
# connecting to the primary via streaming replication to control resource
# removal on the upstream node. This setting has no effect if primary_conninfo
# is not set.
#
#primary_slot_name = ''
#
# By default, a standby server keeps restoring XLOG records from the
# primary indefinitely. If you want to stop the standby mode, finish recovery
# and open the system in read/write mode, specify a path to a trigger file.
# The server will poll the trigger file path periodically and start as a
# primary server when it's found.
#
trigger_file = 'faile'

[postgres@node02 data]$ psql -d ghan
ghan=# \x
扩展显示已打开。
ghan=# SELECT * FROM pg_stat_replication;
-[ RECORD 1 ]----+-----------------------------
pid | 8101
usesysid | 16396
usename | replication
application_name | walreceiver
client_addr | 108.88.3.180
client_hostname |
client_port | 51184
backend_start | 2016-07-11 13:10:58.32561+08
backend_xmin |
state | streaming
sent_location | 1/174A1240
write_location | 1/174A1240
flush_location | 1/174A1240
replay_location | 1/174A1240
sync_priority | 0
sync_state | async

ghan=#

二、备节点（180）环境配置及状态

[postgres@180 data]$ postgresql.conf
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -

wal_level = logical # minimal, replica, or logical
# (change requires restart)
#fsync = on # flush data to disk for crash safety
# (turning this off can cause
# unrecoverable data corruption)

#checkpoint_timeout = 5min # range 30s-1h
max_wal_size = 1GB
min_wal_size = 80MB

# - Sending Server(s) -

# Set these on the master and on any standby that will send replication data.

max_wal_senders = 5 # max number of walsender processes
# (change requires restart)
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables

# - Standby Servers -

# These settings are ignored on a master server.

hot_standby = on # "on" allows queries during recovery
# (change requires restart)

# - Archiving -

archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
archive_command = 'cp %p /archivelog/%f' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only

[postgres@180 data]$ vi pg_hba.conf

# IPv4 local connections:
host all all 127.0.0.1/32 trust
host all all 108.88.3.0/24 trust //这里为了方便直接设置trust

[postgres@node01 data]$ vi recovery.conf //由于正处于备份节点，故所以为recovery.conf

#---------------------------------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------------------------------
#
# standby_mode
#
# When standby_mode is enabled, the PostgreSQL server will work as a
# standby. It will continuously wait for the additional XLOG records, using
# restore_command and/or primary_conninfo.
#
standby_mode = on
#
# primary_conninfo
#
# If set, the PostgreSQL server will try to connect to the primary using this
# connection string and receive XLOG records continuously.
#
primary_conninfo = 'host=108.88.3.116 port=5432 user=replication password=6220104' # e.g. 'host=localhost port=5432'
#
# If set, the PostgreSQL server will use the specified replication slot when
# connecting to the primary via streaming replication to control resource
# removal on the upstream node. This setting has no effect if primary_conninfo
# is not set.
#
#primary_slot_name = ''
#
# By default, a standby server keeps restoring XLOG records from the
# primary indefinitely. If you want to stop the standby mode, finish recovery
# and open the system in read/write mode, specify a path to a trigger file.
# The server will poll the trigger file path periodically and start as a
# primary server when it's found.
#
trigger_file = 'faile'

三、备节点（180）提升为主节点

[postgres@180 data]$ touch faile
[postgres@180 data]$ cd ../
[postgres@180 data]$ tail -f 999.log
LOG: trigger file found: faile
FATAL: terminating walreceiver process due to administrator command
LOG: invalid record length at 1/174A1320: wanted 24, got 0
LOG: redo done at 1/174A12E8
LOG: last completed transaction was at log time 2016-07-11 13:09:25.476144+08
LOG: selected new timeline ID: 9
LOG: archive recovery complete
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

四、新主节点（180）增加数据（表及数据）

[postgres@node01 data]$ psql -d ghan
WARNING: password file "/home/postgres/.pgpass" has group or world access; permissions should be u=rw (0600) or less
psql (9.6beta2)
Type "help" for help.

ghan=# create table t11(id int);
CREATE TABLE
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=# select count(*) from t11
;
count
--------
488716
(1 row)

ghan=#
insert into t8 values(generate_series(1,122179));
INSERT 0 122179
ghan=# select count(*) from t11
;
count
--------
488716
(1 row)

ghan=# select count(*) from t8;
count
--------
382375
(1 row)

ghan=# checkpoint;

五、旧主节点（116）降为备节点

[postgres@116 data]$ pg_ctl -D data/ stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@116 data]$

[postgres@116 data]$ pg_rewind --target-pgdata data --source-server='host=108.88.3.180 port=5432 user=postgres dbname=ghan'
servers diverged at WAL position 1/174A1320 on timeline 8
rewinding from last common checkpoint at 1/174A1278 on timeline 8
Done!
[postgres@116 data]$

[postgres@116 data]$ mv recovery.done recovery.conf
[postgres@116 data]$ vi recovery.conf

primary_conninfo = 'host=108.88.3.180 port=5432 user=replication password=6220104' # e.g. 'host=localhost port=5432'

[postgres@116 data]$ pg_ctl -D data start -l ok.log //如果长时间没有做切换，部份新备机的pg_xlog会找不到部份文件而无法启动，这时需要时主节点下/archivelg下归档复制到新pg_xlog
server starting
[postgres@node02 data]$ tail -f ok.log
LOG: database system was interrupted while in recovery at log time 2016-07-11 13:40:26 CST
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: redo starts at 1/174A1240
LOG: invalid record length at 1/1A33B220: wanted 24, got 0
LOG: started streaming WAL from primary at 1/1A000000 on timeline 9
LOG: consistent recovery state reached at 1/1A33B258
LOG: database system is ready to accept read only connections

[postgres@116 data]$ psql -d ghan
psql (9.4.4, 服务器 9.6beta2)
警告：psql 主版本9.4,服务器主版本为9.6.
一些psql功能可能无法工作.
输入 "help" 来获取帮助信息.

ghan=# select count(*) from t8;
count
--------
382375
(1 行记录)

ghan=# select count(*) from t11;
count
--------
488716
(1 行记录)

ghan=#

六、主备节点状态

备节点状态：

postgres 8271 1 0 13:48 pts/0 00:00:00 /usr/local/pg9.6/bin/postgres -D data
postgres 8272 8271 0 13:48 ? 00:00:00 postgres: startup process recovering 00000009000000010000001A
postgres 8274 8271 0 13:48 ? 00:00:00 postgres: checkpointer process
postgres 8275 8271 0 13:48 ? 00:00:00 postgres: writer process
postgres 8276 8271 0 13:48 ? 00:00:00 postgres: wal receiver process streaming 1/1A33E648
postgres 8277 8271 0 13:48 ? 00:00:00 postgres: stats collector process
postgres 8290 7839 0 13:50 pts/0 00:00:00 ps -ef
[postgres@116 data]$

主节点

postgres 20924 1 0 13:09 pts/0 00:00:00 /usr/local/pg9.6/bin/postgres -D data
postgres 20930 20924 0 13:10 ? 00:00:00 postgres: checkpointer process
postgres 20931 20924 0 13:10 ? 00:00:00 postgres: writer process
postgres 20932 20924 0 13:10 ? 00:00:00 postgres: stats collector process
postgres 21009 20924 0 13:40 ? 00:00:00 postgres: wal writer process
postgres 21010 20924 0 13:40 ? 00:00:00 postgres: autovacuum launcher process
postgres 21011 20924 0 13:40 ? 00:00:00 postgres: archiver process last was 000000090000000100000019
postgres 21044 20924 0 13:47 ? 00:00:00 postgres: wal sender process replication 108.88.3.116(36878) streaming 1/1A33E648
postgres 21055 20731 0 13:49 pts/0 00:00:00 ps -ef
[postgres@180 archivelog]$

ghan=# \x
Expanded display is on.
ghan=# SELECT * FROM pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 21044
usesysid | 16396
usename | replication
application_name | walreceiver
client_addr | 108.88.3.116
client_hostname |
client_port | 36878
backend_start | 2016-07-11 13:47:55.931392+08
backend_xmin |
state | streaming
sent_location | 1/1A33E728
write_location | 1/1A33E728
flush_location | 1/1A33E728
replay_location | 1/1A33E728
sync_priority | 0
sync_state | async

ghan=#

错误解决：

pg_rewind—使一个PostgreSQL数据目录与另一个数据目录（该目录从第一个PostgreSQL数据目录创建而来）一致。

描述

pg_rewind是一个在集群的时间线参数偏离之后，用于使一个PostgreSQL集群与另一个相同集群的拷贝同步的工具。一个典型的场景是在故障转移之后，
让一个老的主服务器重新在线作为一个standby跟随新主服务器。

其结果相当于使用源数据目录替换目标数据目录。所有的文件都被拷贝，包括配置文件。与做一个基础备份或者像rsync这样的工具相比，
pg_rewind的优势是pg_rewind不需要读取所有集群中没有更改的文件。当数据库很大，并且只有一小部分不同的集群之间，使它的速度快得多。

pg_rewind检查源集群与目标集群的时间线历史来检测它们产生分歧的点，并希望在目标集群的pg_xlog目录找到WAL回到分歧点的所有方式。
在典型的故障转移场景：目标集群在分歧之后立即被关闭，那是没有问题的，但是，如果目标集群在分歧之后运行了很长一段时间，老的WAL文件可能不存在了。
在这种情况下，它们可以手动从WAL归档复制到pg_xlog目录。目前不支持从一个WAL归档中自动获取丢失的文件。

在运行pg_rewind之后，当目标服务器第一次被启动，它将进入恢复模式并重放从分歧点之后源服务器产生的所有WAL。
当pg_rewind被运行时，如果一些 WAL在源服务器上不再可用，因此不能用pg_rewind回话复制，当目标服务器被启动时时可以的。
这可以通过在目标数据目录创建一个带有合适的restore_command命令的recovery.conf文件来实现。

阅读(4611) | 评论(0) | 转发(0) |

上一篇：postgresql无法建UFT8编码数据库解决方法

下一篇：postgresql安装pg_stat_statements 模块监控运行状态

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6