mysql 中 character set 与 collation 的点滴理解-laoliulaoliu-ChinaUnix博客

miraclemiracle.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

laoliulaoliu

博客访问： 4608187
博文数量： 1214
博客积分： 13195
博客等级：上将
技术积分： 9105
用户组：普通用户
注册时间： 2007-01-19 14:41

个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文（1214）

cloud（3）
operation（9）
tornado（4）
mac_os（1）
golang（4）
架构（13）
git（4）
security（29）
shell（1）
macbook（1）
ruby（13）
javascript（15）
design（3）
testing（1）
mac（1）
bigdata（69）
nosql（46）
R（9）
gcj/acm（6）
NLP（10）
小说（3）
matlab（4）
web（44）
java（66）
product（7）
c#（1）
language（4）
machine learning（76）
science（4）
opencourse（2）
windows（3）
search（33）
algorithm（65）
database（51）
compiler（11）
ACE（5）
poem（1）
programming（29）
python（140）
assembly（1）
linux（49）
C++（16）
book（2）
cate（1）
phliosophy（3）
mental（30）
Science fiction（1）
Software（5）
c（23）
network（65）
CS（15）
thinking（10）
BSD（13）
solaris10（2）
life（57）
Debian（16）
economy（7）
Mathematics（57）
OS（8）
ibm（2）
gentoo（32）
未分配的博文（8）

文章存档

2021年（13）

2020年（49）

2019年（14）

2018年（27）

2017年（69）

2016年（100）

2015年（106）

2014年（240）

2013年（5）

2012年（193）

2011年（155）

2010年（93）

2009年（62）

2008年（51）

2007年（37）

我的朋友

相关博文

mysql 中 character set 与 collation 的点滴理解

分类： Mysql/postgreSQL

2015-06-04 13:34:42

文章来源：http://zhongwei-leg.iteye.com/blog/899227

使用 mysql 创建数据表的时候，总免不了要涉及到 character set 和 collation 的概念，之前不是很了解。

这两天不是很忙，就自己整理了一下。

先来看看 character set 和 collation 的是什么？

&. character set，即字符集。

我们常看到的 utf-8, GB2312, GB18030 都是相互独立的 character set. 即对 Unicode 的一套编码。

那么如何理解 unicode 与 utf-8, GB2312 的区别呢？
打个比方，你眼前有一个苹果，在英文里称之为 apple, 而在中文里称之为苹果。
苹果这个实体的概念就是 unicode , 而 utf-8, GB2312 可以认为就是不同语言对苹果的不同称谓，本质上都是在描述苹果这个物。

&. collation, 即比对方法。

用于指定数据集如何排序，以及字符串的比对规则。（这样说可能比较抽象，后面会详细解释。）

character set 与 collation 的关系

软件国际化是大势所趋，所以 unicode 是国际化最佳的选择。当然为了提高性能，有些情况下还是使用 latin1 比较好。

mysql 有两个支持 unicode 的 character set:

1. ucs2: 使用 16 bits 来表示一个 unicode 字符。

2. utf8: 使用 1~3 bytes 来表示一个 unicode 字符。

选择哪个 character set 视情况而定，例如 utf8 表示 latin 字符只需要一个字节，所以当用户数据大部分为英文等拉丁字符时，使用 utf8 比较节省数据库的存储空间。据说 SQL Server 采用的是 ucs2，我表示怀疑。

每个 character set 会对应一定数量的 collation. 查看方法是在 mysql 的 console 下输入：

			Java代码  
		
			mysql> show collation;

我们会看到这样的结果：

collation 名字的规则可以归纳为这两类：

1. __

2. _bin

例如:

utf8_danish_ci

ci 是 case insensitive 的缩写， cs 是 case sensitive 的缩写。即，指定大小写是否敏感。

奇怪的是 utf8 字符集对应的 collation 居然没有一个是 cs 的。

那么 utf8_general_ci, utf8_unicode_ci, utf8_danish_ci 有什么区别? 他们各自存在的意义又是什么？

同一个 character set 的不同 collation 的区别在于排序、字符春对比的准确度（相同两个字符在不同国家的语言中的排序规则可能是不同的）以及性能。

例如：

utf8_general_ci 在排序的准确度上要逊于 utf8_unicode_ci，当然，对于英语用户应该没有什么区别。但性能上（排序以及比对速度）要略优于 utf8_unicode_ci. 例如前者没有对德语中

? = ss

的支持。

而 utf8_danish_ci 相比 utf8_unicode_ci 增加了对丹麦语的特殊排序支持。

补充：

1. 当表的 character set 是 latin1 时，若字段类型为 nvarchar, 则字段的字符集自动变为 utf8.

可见 database character set, table character set, field character set 可逐级覆盖。

2. 在 ci 的 collation 下，如何在比对时区分大小写：

写道

推荐使用

mysql> select * from pet where name = binary 'whistler';

这样可以保证当前字段的索引依然有效，而

mysql> select * from pet where binary name = 'whistler';

会使索引失效。

参考列表：

1. What is the best collation to use for mysql with php.

2. Unicode Character Sets

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

3. Show Collation Syntax

http://dev.mysql.com/doc/refman/5.1/en/show-collation.html

4. The Binary Operator

http://dev.mysql.com/doc/refman/5.1/en/charset-binary-op.html

阅读(1495) | 评论(0) | 转发(0) |

上一篇：数据挖掘学习清单

下一篇：Java中String和StringBuffer的区别

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6