关于PostgreSQL的LC_CTYPE-skykiker-ChinaUnix博客

博客访问： 2973222
博文数量： 199
博客积分： 1400
博客等级：上尉
技术积分： 4126
用户组：普通用户
注册时间： 2008-07-06 19:06

个人简介

半个PostgreSQL DBA，热衷于数据库相关的技术。我的ppt分享https://pan.baidu.com/s/1eRQsdAa https://github.com/chenhuajun https://chenhuajun.github.io

文章分类

全部博文（199）

其他（1）
citus（10）
greenlpum（1）
安全（1）
Pacemaker（3）
MySQL（21）
Symfoware（2）

Native（1）
分布式（0）
C（1）
Solaris（1）
Linux（11）
C#（3）
故障案例（5）
NoSQL（4）
云计算（1）
Windows（3）
Database（13）
PostgreSQL（101）

安装配置（1）

HA（3）

doc（6）

Npgsql（1）

psqlODBC（2）
嵌入式开发（8）
Java开发（2）
生活随笔（3）
未分配的博文（4）

文章存档

2020年（5）

2019年（1）

2018年（12）

2017年（23）

2016年（43）

2015年（51）

2014年（27）

2013年（21）

2011年（1）

2010年（4）

2009年（5）

2008年（6）

我的朋友

相关博文

关于PostgreSQL的LC_CTYPE

分类： Mysql/postgreSQL

2014-12-08 12:52:51

LC_CTYPE代表了区域中的字符分类，比如哪些字符是字母，哪些是数字，大小写等。PostgreSQL支持区域相关的行为，其底层实现是调用了操作系统提供的相关接口，比如判断字符大小写的isupper()。因此PostgreSQL中字符分类相关的行为和OS一致，但是实测发现，还是有一些差别的。

根据PostgreSQL的手册，受字符分类(LC_CTYPE)影响的几个功能，有以下几个。

upper, lower, initcap
大小写不敏感的模式匹配
使用了字符分类的正则表达式匹配

下面做一些测试。测试环境为CentOS 6.5 + PostgreSQL 9.3。
PostgreSQL的LC_CTYPE值可以在initdb或createdb时指定，也可以通过collate(实际是LC_COLLATE+LC_CTYPE的组合)在建表或SQL的表达式中指定。下面的测试，使用表达式指定LC_CTYPE。

LC_CTYPE为C时,不能识别全角英文字母。

postgres=# select upper('ａ' collate "C" );
ａ
postgres=# select lower('Ａ' collate "C" );
Ａ
postgres=# select initcap('ａａａ' collate "C" );
ａａａ
postgres=# select 'ａ' ilike 'Ａ' collate "C";
f

LC_CTYPE为zh_CN时，可以识别全角英文字母。

postgres=# select upper('ａ' collate "zh_CN");
Ａ
postgres=# select lower('Ａ' collate "zh_CN");
ａ
postgres=# select initcap('ａａａ' collate "zh_CN");
Ａａａ
postgres=# select 'ａ' ilike 'Ａ' collate "zh_CN";
t

然而对正则表达式的字符分类，不论区域是什么都不识别全角英文字母。

postgres=# select 'Ａ' collate "C" ~ '[[:upper:]]';
f
postgres=# select '１' collate "C" ~ '[[:alnum:]]';
f
postgres=# select 'Ａ' collate "zh_CN" ~ '[[:upper:]]';
f
postgres=# select '１' collate "zh_CN" ~ '[[:alnum:]]';
f

但是OS是支持的。

[chenhj@hanode1 ~]$ export LC_ALL=C
[chenhj@hanode1 ~]$ echo 'Ａ' |grep '[[:upper:]]';
[chenhj@hanode1 ~]$ echo '１' |grep '[[:alnum:]]';
[chenhj@hanode1 ~]$ export LC_ALL=zh_CN.utf8
[chenhj@hanode1 ~]$ echo 'Ａ' |grep '[[:upper:]]';
Ａ
[chenhj@hanode1 ~]$ echo '１' |grep '[[:alnum:]]';
１

为什么会这样？
查看了PostgreSQL中正则表达式实现的代码。原来PostgreSQL中为了确保性能预先把字符分类属性都计算好了缓存起来的。而缓存的字符有限，最多也就缓存pg_wchar值(UTF编码时pg_wchar值相当于unicode代码点)是0~0x7FF的字符，其他的字符都认为不匹配。

src/backend/regex/regc_pg_locale.c

pg_ctype_get_cache(pg_wc_probefunc probefunc)
{
case PG_REGEX_LOCALE_WIDE:
case PG_REGEX_LOCALE_WIDE_L:
max_chr = (pg_wchar) 0x7FF;
...
for (cur_chr = 0; cur_chr <= max_chr; cur_chr++)
{
if ((*probefunc) (cur_chr))
nmatches++;
else if (nmatches > 0)
{
if (!store_match(pcc, cur_chr - nmatches, nmatches))
goto out_of_memory;
nmatches = 0;
}
}
...
}

src/backend/utils/mb/wchar.c

/*
* convert UTF8 string to pg_wchar (UCS-4)
* caller must allocate enough space for "to", including a trailing
* len: length of from.
* "from" not necessarily null terminated.
*/
static int
pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
{
int cnt = 0;
uint32 c1,
c2,
c3,
c4;
while (len > 0 && *from)
{
if ((*from & 0x80) == 0)
{
*to = *from++;
len--;
}
else if ((*from & 0xe0) == 0xc0)
{
if (len < 2)
break; /* drop trailing incomplete char */
c1 = *from++ & 0x1f;
c2 = *from++ & 0x3f;
*to = (c1 << 6) | c2;
len -= 2;
}
else if ((*from & 0xf0) == 0xe0)
{
if (len < 3)
break; /* drop trailing incomplete char */
c1 = *from++ & 0x0f;
c2 = *from++ & 0x3f;
c3 = *from++ & 0x3f;
*to = (c1 << 12) | (c2 << 6) | c3;
len -= 3;
}
else if ((*from & 0xf8) == 0xf0)
{
if (len < 4)
break; /* drop trailing incomplete char */
c1 = *from++ & 0x07;
c2 = *from++ & 0x3f;
c3 = *from++ & 0x3f;
c4 = *from++ & 0x3f;
*to = (c1 << 18) | (c2 << 12) | (c3 << 6) | c4;
len -= 4;
}
else
{
/* treat a bogus char as length 1; not ours to raise error */
*to = *from++;
len--;
}
to++;
cnt++;
}
*to = 0;
return cnt;
}

而全角英文的unicode代码点是超过0x7FF的。
/usr/share/i18n/locales/i18n

...
upper /
...
% HALFWIDTH AND FULLWIDTH FORMS/
<UFF21>..<UFF3A>

再试一下0x7FF以内的某个字符，发现确实是支持的。
比如下面的'?'(0xc5)

话说这些看上去很古怪的字符，会有人在中文里用吗?

阅读(6523) | 评论(0) | 转发(1) |

上一篇：关于PostgreSQL的本地化消息

下一篇：关于数据库中的TIMESTAMP WITH TIME ZONE

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6