PostgreSQL的序列奇快无比的秘密-skykiker-ChinaUnix博客

博客访问： 2970464
博文数量： 199
博客积分： 1400
博客等级：上尉
技术积分： 4126
用户组：普通用户
注册时间： 2008-07-06 19:06

个人简介

半个PostgreSQL DBA，热衷于数据库相关的技术。我的ppt分享https://pan.baidu.com/s/1eRQsdAa https://github.com/chenhuajun https://chenhuajun.github.io

文章分类

全部博文（199）

其他（1）
citus（10）
greenlpum（1）
安全（1）
Pacemaker（3）
MySQL（21）
Symfoware（2）

Native（1）
分布式（0）
C（1）
Solaris（1）
Linux（11）
C#（3）
故障案例（5）
NoSQL（4）
云计算（1）
Windows（3）
Database（13）
PostgreSQL（101）

安装配置（1）

HA（3）

doc（6）

Npgsql（1）

psqlODBC（2）
嵌入式开发（8）
Java开发（2）
生活随笔（3）
未分配的博文（4）

文章存档

2020年（5）

2019年（1）

2018年（12）

2017年（23）

2016年（43）

2015年（51）

2014年（27）

2013年（21）

2011年（1）

2010年（4）

2009年（5）

2008年（6）

我的朋友

1. 引言

曾经有篇流传较广的文章Don’t Assume PostgreSQL is Slow 展示了PostgreSQL生成序列的速度不亚于redis的INCRs。而在此之前我就曾做过相关的测试（参考PostgreSQL的序列的性能验证），发现PG生成序列的速度远高于同类的关系数据库。根据PostgreSQL的序列的性能验证中测试结果，在没有启用序列cache的情况下，PG的每次调用nextval('seq1')的额外时间消耗大概是0.3us，也就是333w/s，所以即使做批量数据加载也不用担心序列拖后腿；而Oracle的nocache序列生成速度大概只有5w/s,当Oracle序列 cache了50以上时，速度才开始接近pg。
这个结果很惊人，但细一想，PG快得有点离谱。为什么这么说？因为当时测试的select nextval('seq1')在4核虚机上达到了7w/s的qps，而那个测试环境估计支撑不了这么高的iops，所以猜测PG一定对序列做了某种优化而不是每次刷盘。

2. 代码分析

关键代码见src/backend/commands/sequence.c的nextval_internal()函数，有个叫SEQ_LOG_VALS的常量，控制PG每产生32个序列值才记一次WAL。这相当于PG对序列做了全局缓存，而PG的create sequence语法上的cache是指每个进程（也就是连接）的本地cache。由于全局缓存优化的已经足够好了，所以一般不需要再启用本地cache。

src/backend/commands/sequence.c

点击(此处)折叠或打开

fetch = cache = seq->cache_value;
log = seq->log_cnt;
...
/*
* Decide whether we should emit a WAL log record. If so, force up the
* fetch count to grab SEQ_LOG_VALS more values than we actually need to
* cache. (These will then be usable without logging.)
*
* If this is the first nextval after a checkpoint, we must force a new
* WAL record to be written anyway, else replay starting from the
* checkpoint would fail to advance the sequence past the logged values.
* In this case we may as well fetch extra values.
*/
if (log < fetch || !seq->is_called)//此处fetch值为1.每次调nextval()log_cnt会递减,减到0时设置logit标志位
{
/* forced log to satisfy local demand for values */
fetch = log = fetch + SEQ_LOG_VALS;
logit = true;
}
else
{
XLogRecPtr redoptr = GetRedoRecPtr();
if (PageGetLSN(page) <= redoptr)
{
/* last update of seq was before checkpoint */
fetch = log = fetch + SEQ_LOG_VALS;
logit = true;
}
}
...
if (logit && RelationNeedsWAL(seqrel))
{
xl_seq_rec xlrec;
XLogRecPtr recptr;
/*
* We don't log the current state of the tuple, but rather the state
* as it would appear after "log" more fetches. This lets us skip
* that many future WAL records, at the cost that we lose those
* sequence values if we crash.
*/
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
/* set values that will be saved in xlog */
seq->last_value = next;//WAL中记录的last_value是下一轮的序列值，所以pg crash再通过WAL恢复后，新产生的序列会跳过几个值
seq->is_called = true;
seq->log_cnt = 0;
xlrec.node = seqrel->rd_node;
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
XLogRegisterData((char *) seqtuple.t_data, seqtuple.t_len);
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
PageSetLSN(page, recptr);
}

3. 实测验证

3.1 WAL写入时机

PG通过内部log_cnt计数器控制是否要记录序列更新的WAL，新建的序列，计数器初始值为0。

postgres=# create sequence seq1;
CREATE SEQUENCE
postgres=# \d seq1
Sequence "public.seq1"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | seq1
last_value | bigint | 1
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | f

取得第一个序列值后，log_cnt变成32。

postgres=# select nextval('seq1');
nextval
---------
1
(1 row)
postgres=# \d seq1
Sequence "public.seq1"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | seq1
last_value | bigint | 1
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 32
is_cycled | boolean | f
is_called | boolean | t

这个过程中，通过strace监视"wal writer process"进程,可以发现发生了WAL写入和刷盘。

[root@localhost ~]# strace -efsync,write,fdatasync -p 2997
Process 2997 attached
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3209, si_uid=1001} ---
write(11, "\0", 1) = 1
write(3, "\207\320\5\0\1\0\0\0\0\300\273\t\0\0\0\0I\10\0\0\0\0\0\0\n\0\0004-\0\0\0"..., 8192) = 8192
fdatasync(3) = 0

以后每次获取序列值，log_cnt会减1，但只有log_cnt减到0，并从0重新跳到32的时候，strace中才能看到WAL写入。

postgres=# select nextval('seq1');
nextval
---------
2
(1 row)
postgres=# \d seq1
Sequence "public.seq1"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | seq1
last_value | bigint | 2
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 31
is_cycled | boolean | f
is_called | boolean | t

3.2 PG crash后的序列值

PG在记录序列的WAL时，记录的是当前值+32。所以如果PG crash再恢复后，将跳过一部分从未使用的序列值。这样做避免了产生重复序列的可能，但不能保证序列的连续，这是优化WAL写入而付出的必要代价。
下面是使用kill -9杀PG进程的情况。

点击(此处)折叠或打开

postgres=# select nextval('seq1');
nextval
---------
20
(1 row)
postgres=# \d seq1
Sequence "public.seq1"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | seq1
last_value | bigint | 20
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 28
is_cycled | boolean | f
is_called | boolean | t
postgres=# select nextval('seq1');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=# select nextval('seq1');
nextval
---------
49
(1 row)

不过，正常关闭或重启PG是不会出现这种问题的。

阅读(4715) | 评论(0) | 转发(0) |

上一篇：为什么Mysql explain extended中的filtered列值总是100%

下一篇：动态修改LVS real server配置的影响验证

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6