1. 引言
曾经有篇流传较广的文章
Don’t Assume PostgreSQL is Slow 展示了PostgreSQL生成
序列的速度不亚于redis的INCR
s。而在此之前我就曾做过相关的测试(参考
PostgreSQL的序列的性能验证),发现PG
生成序列的速度远高于同类的关系数据库。
根据PostgreSQL的序列的性能验证 中测试结果,在没有启用序列cache的情况下,PG的每次调用nextval('seq1')的额外时间消耗大概是0.3us,也就是333w/s,所以即使做批量数据加载也不用担心序列拖后腿;而Oracle的nocache序列生成速度大概只有5w/s,当Oracle序列 cache了50以上时,速度才开始接近pg。
这个结果很惊人,但细一想,PG快得有点离谱。为什么这么说?因为当时测试的select nextval('seq1')在4核虚机上达到了7w/s的qps,而那个测试环境估计支撑不了这么高的iops,所以猜测PG一定对序列做了某种优化而不是每次刷盘。
2. 代码分析
关键代码见src/backend/commands/sequence.c的nextval_internal()函数,有个叫SEQ_LOG_VALS的常量,控制PG每产生32个序列值才记一次WAL。这相当于PG对序列做了全局缓存,而PG的create sequence语法上的cache是指每个进程(也就是连接)的本地cache。由于全局缓存优化的已经足够好了,所以一般不需要再启用本地cache。
src/backend/commands/sequence.c
-
fetch = cache = seq->cache_value;
-
log = seq->log_cnt;
-
...
-
/*
-
* Decide whether we should emit a WAL log record. If so, force up the
-
* fetch count to grab SEQ_LOG_VALS more values than we actually need to
-
* cache. (These will then be usable without logging.)
-
*
-
* If this is the first nextval after a checkpoint, we must force a new
-
* WAL record to be written anyway, else replay starting from the
-
* checkpoint would fail to advance the sequence past the logged values.
-
* In this case we may as well fetch extra values.
-
*/
-
if (log < fetch || !seq->is_called)//此处fetch值为1.每次调nextval()log_cnt会递减,减到0时设置logit标志位
-
{
-
/* forced log to satisfy local demand for values */
-
fetch = log = fetch + SEQ_LOG_VALS;
-
logit = true;
-
}
-
else
-
{
-
XLogRecPtr redoptr = GetRedoRecPtr();
-
-
if (PageGetLSN(page) <= redoptr)
-
{
-
/* last update of seq was before checkpoint */
-
fetch = log = fetch + SEQ_LOG_VALS;
-
logit = true;
-
}
-
}
-
...
-
if (logit && RelationNeedsWAL(seqrel))
-
{
-
xl_seq_rec xlrec;
-
XLogRecPtr recptr;
-
-
/*
-
* We don't log the current state of the tuple, but rather the state
-
* as it would appear after "log" more fetches. This lets us skip
-
* that many future WAL records, at the cost that we lose those
-
* sequence values if we crash.
-
*/
-
XLogBeginInsert();
-
XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
-
-
/* set values that will be saved in xlog */
-
seq->last_value = next;//WAL中记录的last_value是下一轮的序列值,所以pg crash再通过WAL恢复后,新产生的序列会跳过几个值
-
seq->is_called = true;
-
seq->log_cnt = 0;
-
-
xlrec.node = seqrel->rd_node;
-
-
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
-
XLogRegisterData((char *) seqtuple.t_data, seqtuple.t_len);
-
-
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
-
-
PageSetLSN(page, recptr);
-
}
3. 实测验证
3.1 WAL写入时机
PG通过内部log_cnt计数器控制是否要记录序列更新的WAL,新建的序列,计数器初始值为0。
-
postgres=# create sequence seq1;
-
CREATE SEQUENCE
-
postgres=# \d seq1
-
Sequence "public.seq1"
-
Column | Type | Value
-
---------------+---------+---------------------
-
sequence_name | name | seq1
-
last_value | bigint | 1
-
start_value | bigint | 1
-
increment_by | bigint | 1
-
max_value | bigint | 9223372036854775807
-
min_value | bigint | 1
-
cache_value | bigint | 1
-
log_cnt | bigint | 0
-
is_cycled | boolean | f
-
is_called | boolean | f
取得第一个序列值后,log_cnt变成32。
-
postgres=# select nextval('seq1');
-
nextval
-
---------
-
1
-
(1 row)
-
-
postgres=# \d seq1
-
Sequence "public.seq1"
-
Column | Type | Value
-
---------------+---------+---------------------
-
sequence_name | name | seq1
-
last_value | bigint | 1
-
start_value | bigint | 1
-
increment_by | bigint | 1
-
max_value | bigint | 9223372036854775807
-
min_value | bigint | 1
-
cache_value | bigint | 1
-
log_cnt | bigint | 32
-
is_cycled | boolean | f
-
is_called | boolean | t
这个过程中,通过strace监视"wal writer process"进程,可以发现发生了WAL写入和刷盘。
-
[root@localhost ~]# strace -efsync,write,fdatasync -p 2997
-
Process 2997 attached
-
-
-
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3209, si_uid=1001} ---
-
write(11, "\0", 1) = 1
-
write(3, "\207\320\5\0\1\0\0\0\0\300\273\t\0\0\0\0I\10\0\0\0\0\0\0\n\0\0004-\0\0\0"..., 8192) = 8192
-
fdatasync(3) = 0
以后每次获取序列值,log_cnt会减1,但只有log_cnt减到0,并从0重新跳到32的时候,strace中才能看到WAL写入。
-
postgres=# select nextval('seq1');
-
nextval
-
---------
-
2
-
(1 row)
-
-
postgres=# \d seq1
-
Sequence "public.seq1"
-
Column | Type | Value
-
---------------+---------+---------------------
-
sequence_name | name | seq1
-
last_value | bigint | 2
-
start_value | bigint | 1
-
increment_by | bigint | 1
-
max_value | bigint | 9223372036854775807
-
min_value | bigint | 1
-
cache_value | bigint | 1
-
log_cnt | bigint | 31
-
is_cycled | boolean | f
-
is_called | boolean | t
3.2 PG crash后的序列值
PG在记录序列的WAL时,记录的是当前值+32。所以如果PG crash再恢复后,将跳过一部分从未使用的序列值。这样做避免了产生重复序列的可能,但不能保证序列的连续,这是优化WAL写入而付出的必要代价。
下面是使用kill -9杀PG进程的情况。
-
postgres=# select nextval('seq1');
-
nextval
-
---------
-
20
-
(1 row)
-
-
postgres=# \d seq1
-
Sequence "public.seq1"
-
Column | Type | Value
-
---------------+---------+---------------------
-
sequence_name | name | seq1
-
last_value | bigint | 20
-
start_value | bigint | 1
-
increment_by | bigint | 1
-
max_value | bigint | 9223372036854775807
-
min_value | bigint | 1
-
cache_value | bigint | 1
-
log_cnt | bigint | 28
-
is_cycled | boolean | f
-
is_called | boolean | t
-
-
postgres=# select nextval('seq1');
-
server closed the connection unexpectedly
-
This probably means the server terminated abnormally
-
before or while processing the request.
-
The connection to the server was lost. Attempting reset: Succeeded.
-
postgres=# select nextval('seq1');
-
nextval
-
---------
-
49
-
(1 row)
不过,正常关闭或重启PG是不会出现这种问题的。
阅读(4669) | 评论(0) | 转发(0) |