副本集数据同步机制
实例启动后,副本集成员首先通过一次初始化同步来复制数据,然后就开始利用oplog日志和主节点连续进行数据同步。如下日志反应这一过程
-
Wed Jul 17 13:18:39.473 [rsSync] replSet initial sync pending
-
Wed Jul 17 13:18:39.473 [rsSync] replSet syncing to: 192.168.69.44:10000
-
Wed Jul 17 13:18:39.476 [rsSync] build index local.me { _id: 1 }
-
Wed Jul 17 13:18:39.479 [rsSync] build index done. scanned 0 total records. 0.002 secs
-
Wed Jul 17 13:18:39.481 [rsSync] build index local.replset.minvalid { _id: 1 }
-
Wed Jul 17 13:18:39.482 [rsSync] build index done. scanned 0 total records. 0 secs
-
Wed Jul 17 13:18:39.482 [rsSync] replSet initial sync drop all databases
-
Wed Jul 17 13:18:39.482 [rsSync] dropAllDatabasesExceptLocal 1
-
Wed Jul 17 13:18:39.483 [rsSync] replSet initial sync clone all databases
-
Wed Jul 17 13:18:39.483 [rsSync] replSet initial sync cloning db: test
-
Wed Jul 17 13:18:39.485 [FileAllocator] allocating new datafile /mongodb/sh1/data/test.ns, filling with zeroes...0.001 secs
-
Wed Jul 17 13:18:39.497 [rsSync] build index test.test { _id: 1 }
-
Wed Jul 17 13:18:39.498 [rsSync] fastBuildIndex dupsToDrop:0
-
Wed Jul 17 13:18:39.498 [rsSync] build index done. scanned 1 total records. 0 secs
-
Wed Jul 17 13:18:39.498 [rsSync] replSet initial sync data copy, starting syncup
-
Wed Jul 17 13:18:39.498 [rsSync] oplog sync 1 of 3
-
Wed Jul 17 13:18:39.499 [rsSync] oplog sync 2 of 3
-
Wed Jul 17 13:18:39.499 [rsSync] replSet initial sync building indexes
-
Wed Jul 17 13:18:39.499 [rsSync] replSet initial sync cloning indexes for : test
-
Wed Jul 17 13:18:39.500 [rsSync] oplog sync 3 of 3
-
Wed Jul 17 13:18:39.500 [rsSync] replSet initial sync finishing up
-
Wed Jul 17 13:18:39.527 [rsSync] replSet set minValid=51e6290b:1
-
Wed Jul 17 13:18:39.527 [rsSync] replSet RECOVERING
-
Wed Jul 17 13:18:39.527 [rsSync] replSet initial sync done
-
Wed Jul 17 13:18:40.384 [rsBackgroundSync] replSet syncing to: 192.168.69.44:10000
-
Wed Jul 17 13:18:40.527 [rsSyncNotifier] replset setting oplog notifier to 192.168.69.44:10000
oplog:多机replication通过oplog来实现,primary向oplog写操作记录,每条记录包含了文档修改,删除,更新信息。secondary复制oplog并replay实现与primary的同步。oplog是capped collection,老的日志会被overwrite,如果secondary落后主节点数据量超过oplog大小,会被认为是stale node,它会进行全部primary sync操作,所以要根据实际情况预先设置好oplogSize。
oplog 在replica set中存在于local.oplog.rs集合,是一个capped collection,启动时候可以通过--oplogSize设置大小,对于linux 和windows 64位,oplog size默认为剩余磁盘空间的5%。
-
sh1:PRIMARY> db.printReplicationInfo()
-
configured oplog size: 23446.183007812502MB --配置的oplog日志大小
-
log length start to end: 23055secs (6.4hrs) --日志记录的文档的时间范围
-
oplog first event time: Wed Jul 17 2013 09:48:00 GMT+0800 (CST) --最早的日志时间
-
oplog last event time: Wed Jul 17 2013 16:12:15 GMT+0800 (CST) --最近一次的日志时间
-
now: Wed Jul 17 2013 16:12:18 GMT+0800 (CST)
-
下面命令展示的更为直观:
-
sh1:PRIMARY> db.getReplicationInfo()
-
{
-
"logSizeMB" : 23446.183007812502,
-
"usedMB" : 11.22,
-
"timeDiff" : 23521,
-
"timeDiffHours" : 6.53,
-
"tFirst" : "Wed Jul 17 2013 09:48:00 GMT+0800 (CST)",
-
"tLast" : "Wed Jul 17 2013 16:20:01 GMT+0800 (CST)",
-
"now" : "Wed Jul 17 2013 16:20:20 GMT+0800 (CST)"
-
}
节点之间数据同步方式有两种:Init sync和keep
Init sync同步在下面两种情况发生:
1. secondary第一次加入,就会进行初始化同步,把所有的数据同步过来。
2. Secondary被移出,重新移入后由于数据落后,也会进行init sync应用新的日志。如果节点落后的数量超过opolog大小,也就是说,oplog被覆盖过,那么他会启用一次全量备份,把所有数据复制过来。所以,已经有大量的数据时,加入一个新节点要注意全量复制带来的网络负担。应用高峰期时,不适合做这些操作。
Keep模式复制:
这种复制方式是持续性的,是主节点和副节点正常运行期间的数据同步方式。
Init sync过后,节点之间的复制方式就是实时同步了,一般情况下,secondaries从primary复制数据,但是secondary的复制对象可能会根据网络延时做出一些选择,也许会形成链式同步结构,参见
如果两个副节点节在进行复制,那么要求设置相同的buildindex值(默认已经是true)。Secondaries不可能从延迟节点和隐藏节点复制数据。
手从修改syn target:
sh0:SECONDARY> db.adminCommand({replSetSyncFrom:"192.168.69.40:10000"})
{
"syncFromRequested" : "192.168.69.40:10000",
"prevSyncTarget" : "192.168.69.40:10000",
"ok" : 1
}
同步过程如下:
从数据源复制数据(源不一定非要primary),应用日志
建立相应索引
sh1:PRIMARY> db.printSlaveReplicationInfo()
source: 192.168.69.45:10000
syncedTo: Wed Jul 17 2013 16:52:29 GMT+0800 (CST)
= 32 secs ago (0.01hrs)
source: 192.168.69.46:10000
syncedTo: Wed Jul 17 2013 16:52:29 GMT+0800 (CST)
= 32 secs ago (0.01hrs)
可以使用调试命令观察secondary最后同步的一条日志信息:
-
sh1:SECONDARY> rs.debug.getLastOpWritten()
-
{
-
"ts" : {
-
"t" : 1374154241,
-
"i" : 1
-
},
-
"h" : NumberLong("7643447760057625437"),
-
"v" : 2,
-
"op" : "i",
-
"ns" : "test.test",
-
"o" : {
-
"_id" : ObjectId("51e7ee01b5ed8bdc9436becf"),
-
"name" : 5258
-
}
-
}
可通过db.oplog.rs.find().sort({$natural:-1})查看opolog日志:
{ "ts" : { "t" : 1374154241, "i" : 1 }, "h" : NumberLong("7643447760057625437"), "v" : 2, "op" : "i", "ns" : "test.test", "o" : { "_id" : ObjectId("51e7ee01b5ed8bdc9436becf"), "name" : 5258 } }
oplog结构:
由7部分组成:{ts:{},h{},v{} op:{},ns:{},o:{},o2:{} }
ts:8字节的时间戳,由4字节unix timestamp + 4字节自增计数表示
h: 一个哈希值(2.0+),确保oplog的连续性
v:
op:1字节的操作类型
ns:操作所在的namespace。
o:操作所对应的document
o2:在执行更新操作时o只记录id而o2记录具体内容,仅限于update
op操作类型:
“i”:insert操作
“u”:update操作
“d”:delete操作
“c”:db cmd操作
"db":声明当前数据库 (其中ns 被设置成为=>数据库名称+ '.')
"n": no op,即空操作,定期执行以确保时效性
参见:
http://www.cnblogs.com/daizhj/archive/2011/06/27/mongodb_sourcecode_oplog.html
阅读(6655) | 评论(0) | 转发(0) |