当前环境:
主节点:108.88.3.116
备节点:108.88.3.180
一、主节点(116)环境配置及状态
[postgres@116 data]$ postgresql.conf
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------
# - Settings -
wal_level = logical # minimal, replica, or logical
# (change requires restart)
#fsync = on # flush data to disk for crash safety
# (turning this off can cause
# unrecoverable data corruption)
#checkpoint_timeout = 5min # range 30s-1h
max_wal_size = 1GB
min_wal_size = 80MB
# - Sending Server(s) -
# Set these on the master and on any standby that will send replication data.
max_wal_senders = 5 # max number of walsender processes
# (change requires restart)
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables
# - Standby Servers -
# These settings are ignored on a master server.
hot_standby = on # "on" allows queries during recovery
# (change requires restart)
# - Archiving -
archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
archive_command = 'cp %p /archivelog/%f' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only
[postgres@116 data]$ vi pg_hba.conf
# IPv4 local connections:
host all all 127.0.0.1/32 trust
host all all 108.88.3.0/24 trust //这里为了方便直接设置trust
[postgres@116 data]$ vi recovery.done //用于节点降为备节点的准备文件,在主节点不起作用,为了切换备节点时方便
#---------------------------------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------------------------------
#
# standby_mode
#
# When standby_mode is enabled, the PostgreSQL server will work as a
# standby. It will continuously wait for the additional XLOG records, using
# restore_command and/or primary_conninfo.
#
standby_mode = on
#
# primary_conninfo
#
# If set, the PostgreSQL server will try to connect to the primary using this
# connection string and receive XLOG records continuously.
#
primary_conninfo = 'host=108.88.3.180 port=5432 user=replication password=6220104' # e.g. 'host=localhost port=5432'
#
# If set, the PostgreSQL server will use the specified replication slot when
# connecting to the primary via streaming replication to control resource
# removal on the upstream node. This setting has no effect if primary_conninfo
# is not set.
#
#primary_slot_name = ''
#
# By default, a standby server keeps restoring XLOG records from the
# primary indefinitely. If you want to stop the standby mode, finish recovery
# and open the system in read/write mode, specify a path to a trigger file.
# The server will poll the trigger file path periodically and start as a
# primary server when it's found.
#
trigger_file = 'faile'
[postgres@node02 data]$ psql -d ghan
ghan=# \x
扩展显示已打开。
ghan=# SELECT * FROM pg_stat_replication;
-[ RECORD 1 ]----+-----------------------------
pid | 8101
usesysid | 16396
usename | replication
application_name | walreceiver
client_addr | 108.88.3.180
client_hostname |
client_port | 51184
backend_start | 2016-07-11 13:10:58.32561+08
backend_xmin |
state | streaming
sent_location | 1/174A1240
write_location | 1/174A1240
flush_location | 1/174A1240
replay_location | 1/174A1240
sync_priority | 0
sync_state | async
ghan=#
二、备节点(180)环境配置及状态
[postgres@180 data]$ postgresql.conf
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------
# - Settings -
wal_level = logical # minimal, replica, or logical
# (change requires restart)
#fsync = on # flush data to disk for crash safety
# (turning this off can cause
# unrecoverable data corruption)
#checkpoint_timeout = 5min # range 30s-1h
max_wal_size = 1GB
min_wal_size = 80MB
# - Sending Server(s) -
# Set these on the master and on any standby that will send replication data.
max_wal_senders = 5 # max number of walsender processes
# (change requires restart)
#wal_keep_segments = 0 # in logfile segments, 16MB each; 0 disables
# - Standby Servers -
# These settings are ignored on a master server.
hot_standby = on # "on" allows queries during recovery
# (change requires restart)
# - Archiving -
archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
archive_command = 'cp %p /archivelog/%f' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only
[postgres@180 data]$ vi pg_hba.conf
# IPv4 local connections:
host all all 127.0.0.1/32 trust
host all all 108.88.3.0/24 trust //这里为了方便直接设置trust
[postgres@node01 data]$ vi recovery.conf //由于正处于备份节点,故所以为recovery.conf
#---------------------------------------------------------------------------
# STANDBY SERVER PARAMETERS
#---------------------------------------------------------------------------
#
# standby_mode
#
# When standby_mode is enabled, the PostgreSQL server will work as a
# standby. It will continuously wait for the additional XLOG records, using
# restore_command and/or primary_conninfo.
#
standby_mode = on
#
# primary_conninfo
#
# If set, the PostgreSQL server will try to connect to the primary using this
# connection string and receive XLOG records continuously.
#
primary_conninfo = 'host=108.88.3.116 port=5432 user=replication password=6220104' # e.g. 'host=localhost port=5432'
#
# If set, the PostgreSQL server will use the specified replication slot when
# connecting to the primary via streaming replication to control resource
# removal on the upstream node. This setting has no effect if primary_conninfo
# is not set.
#
#primary_slot_name = ''
#
# By default, a standby server keeps restoring XLOG records from the
# primary indefinitely. If you want to stop the standby mode, finish recovery
# and open the system in read/write mode, specify a path to a trigger file.
# The server will poll the trigger file path periodically and start as a
# primary server when it's found.
#
trigger_file = 'faile'
三、备节点(180)提升为主节点
[postgres@180 data]$ touch faile
[postgres@180 data]$ cd ../
[postgres@180 data]$ tail -f 999.log
LOG: trigger file found: faile
FATAL: terminating walreceiver process due to administrator command
LOG: invalid record length at 1/174A1320: wanted 24, got 0
LOG: redo done at 1/174A12E8
LOG: last completed transaction was at log time 2016-07-11 13:09:25.476144+08
LOG: selected new timeline ID: 9
LOG: archive recovery complete
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
四、新主节点(180)增加数据(表及数据)
[postgres@node01 data]$ psql -d ghan
WARNING: password file "/home/postgres/.pgpass" has group or world access; permissions should be u=rw (0600) or less
psql (9.6beta2)
Type "help" for help.
ghan=# create table t11(id int);
CREATE TABLE
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=#
insert into t11 values(generate_series(1,122179));
INSERT 0 122179
ghan=# select count(*) from t11
;
count
--------
488716
(1 row)
ghan=#
insert into t8 values(generate_series(1,122179));
INSERT 0 122179
ghan=# select count(*) from t11
;
count
--------
488716
(1 row)
ghan=# select count(*) from t8;
count
--------
382375
(1 row)
ghan=# checkpoint;
五、旧主节点(116)降为备节点
[postgres@116 data]$ pg_ctl -D data/ stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@116 data]$
[postgres@116 data]$ pg_rewind --target-pgdata data --source-server='host=108.88.3.180 port=5432 user=postgres dbname=ghan'
servers diverged at WAL position 1/174A1320 on timeline 8
rewinding from last common checkpoint at 1/174A1278 on timeline 8
Done!
[postgres@116 data]$
[postgres@116 data]$ mv recovery.done recovery.conf
[postgres@116 data]$ vi recovery.conf
primary_conninfo = 'host=108.88.3.180 port=5432 user=replication password=6220104' # e.g. 'host=localhost port=5432'
[postgres@116 data]$ pg_ctl -D data start -l ok.log //如果长时间没有做切换,部份新备机的pg_xlog会找不到部份文件而无法启动,这时需要时主节点下/archivelg下归档复制到新pg_xlog
server starting
[postgres@node02 data]$ tail -f ok.log
LOG: database system was interrupted while in recovery at log time 2016-07-11 13:40:26 CST
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: redo starts at 1/174A1240
LOG: invalid record length at 1/1A33B220: wanted 24, got 0
LOG: started streaming WAL from primary at 1/1A000000 on timeline 9
LOG: consistent recovery state reached at 1/1A33B258
LOG: database system is ready to accept read only connections
[postgres@116 data]$ psql -d ghan
psql (9.4.4, 服务器 9.6beta2)
警告:psql 主版本9.4,服务器主版本为9.6.
一些psql功能可能无法工作.
输入 "help" 来获取帮助信息.
ghan=# select count(*) from t8;
count
--------
382375
(1 行记录)
ghan=# select count(*) from t11;
count
--------
488716
(1 行记录)
ghan=#
六、主备节点状态
备节点状态:
postgres 8271 1 0 13:48 pts/0 00:00:00 /usr/local/pg9.6/bin/postgres -D data
postgres 8272 8271 0 13:48 ? 00:00:00 postgres: startup process recovering 00000009000000010000001A
postgres 8274 8271 0 13:48 ? 00:00:00 postgres: checkpointer process
postgres 8275 8271 0 13:48 ? 00:00:00 postgres: writer process
postgres 8276 8271 0 13:48 ? 00:00:00 postgres: wal receiver process streaming 1/1A33E648
postgres 8277 8271 0 13:48 ? 00:00:00 postgres: stats collector process
postgres 8290 7839 0 13:50 pts/0 00:00:00 ps -ef
[postgres@116 data]$
主节点
postgres 20924 1 0 13:09 pts/0 00:00:00 /usr/local/pg9.6/bin/postgres -D data
postgres 20930 20924 0 13:10 ? 00:00:00 postgres: checkpointer process
postgres 20931 20924 0 13:10 ? 00:00:00 postgres: writer process
postgres 20932 20924 0 13:10 ? 00:00:00 postgres: stats collector process
postgres 21009 20924 0 13:40 ? 00:00:00 postgres: wal writer process
postgres 21010 20924 0 13:40 ? 00:00:00 postgres: autovacuum launcher process
postgres 21011 20924 0 13:40 ? 00:00:00 postgres: archiver process last was 000000090000000100000019
postgres 21044 20924 0 13:47 ? 00:00:00 postgres: wal sender process replication 108.88.3.116(36878) streaming 1/1A33E648
postgres 21055 20731 0 13:49 pts/0 00:00:00 ps -ef
[postgres@180 archivelog]$
ghan=# \x
Expanded display is on.
ghan=# SELECT * FROM pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 21044
usesysid | 16396
usename | replication
application_name | walreceiver
client_addr | 108.88.3.116
client_hostname |
client_port | 36878
backend_start | 2016-07-11 13:47:55.931392+08
backend_xmin |
state | streaming
sent_location | 1/1A33E728
write_location | 1/1A33E728
flush_location | 1/1A33E728
replay_location | 1/1A33E728
sync_priority | 0
sync_state | async
ghan=#
错误解决:
pg_rewind—使一个PostgreSQL数据目录与另一个数据目录(该目录从第一个PostgreSQL数据目录创建而来)一致。
描述
pg_rewind是一个在集群的时间线参数偏离之后,用于使一个PostgreSQL集群与另一个相同集群的拷贝同步的工具。一个典型的场景是在故障转移之后,
让一个老的主服务器重新在线作为一个standby跟随新主服务器。
其结果相当于使用源数据目录替换目标数据目录。所有的文件都被拷贝,包括配置文件。与做一个基础备份或者像rsync这样的工具相比,
pg_rewind的优势是pg_rewind不需要读取所有集群中没有更改的文件。当数据库很大,并且只有一小部分不同的集群之间,使它的速度快得多。
pg_rewind检查源集群与目标集群的时间线历史来检测它们产生分歧的点,并希望在目标集群的pg_xlog目录找到WAL回到分歧点的所有方式。
在典型的故障转移场景:目标集群在分歧之后立即被关闭,那是没有问题的,但是,如果目标集群在分歧之后运行了很长一段时间,老的WAL文件可能不存在了。
在这种情况下,它们可以手动从WAL归档复制到pg_xlog目录。目前不支持从一个WAL归档中自动获取丢失的文件。
在运行pg_rewind之后,当目标服务器第一次被启动,它将进入恢复模式并重放从分歧点之后源服务器产生的所有WAL。
当pg_rewind被运行时,如果一些 WAL在源服务器上不再可用,因此不能用pg_rewind回话复制,当目标服务器被启动时时可以的。
这可以通过在目标数据目录创建一个带有合适的restore_command命令的recovery.conf文件来实现。
阅读(4542) | 评论(0) | 转发(0) |