我们都知道对于UNICODE来说,
UCS-2是内码,而UTF-8则是它的实现方式。每一个字节都有8个位,而对于UTF-8来说,每一个字节的前两位尤为重要,按照前两位的不同,一共有四种排列组合:00xxxxxx,01xxxxxx,10xxxxxx,11xxxxxx。
按照UTF-8标准,
(1)所有以0开始的字节,都与原来的ASCII码兼容,也就是说,0xxxxxxx不需要额外转换,就是我们平时用的ASCII码。
(2)所有以10开始的字节,都不是每个UNICODE的第一个字节,都是紧跟着前一位。例如:10110101,这个字节不可以单独解析,必须通过前一个字节来解析,如果前一个也是10开头,就继续前嗍。
(3)所有以11开始的字节,都表示是UNICODE的第一个字节,而且后面紧跟着若干个以10开头的字节。如果是110xxxxx(就是最左边的0的左边有2个1),代表后面还有1个10xxxxxx;如果是1110xxxx(就是最左边的0的左边有3个1),代表后面还有2个10xxxxxx;以此类推,一直到1111110x。
具体的表格如下:
1字节 0xxxxxxx
2字节 110xxxxx 10xxxxxx
3字节 1110xxxx 10xxxxxx 10xxxxxx
4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
(很明显,以11开头的,最左边的0左边有多少个1,那这个UCS的UTF-8的表示长度就有多少个字节)
上面是用6个字节,最多可以表示2 ^ 31个的字符,实际上,只有UCS-4才有这么多的字符,对于
UCS-2,仅仅有2 ^ 16个字符,只需要三个字节就可以,也就是说,只需要用到下面的格式:
1字节 0xxxxxxx
2字节 110xxxxx 10xxxxxx
3字节 1110xxxx 10xxxxxx 10xxxxxx
大家可以试一下下面的program,来看看UTF-8的每个字节。
- package com.ray.utf8;
-
- import java.io.UnsupportedEncodingException;
-
- public class UTF8Tester {
-
- private static String toBin(int n) {
- StringBuilder b = new StringBuilder();
- if (n < 0) n += 256;
- for (int i = 7; i >= 0; i--) {
- if (1 == ((n >> i) & 1)) {
- b.append('1');
- } else {
- b.append('0');
- }
- }
- return b.toString();
- }
-
- private static String HEX = "0123456789ABCDEF";
- private static String toHex(int n) {
- StringBuilder b = new StringBuilder();
- if (n < 0) n += 256;
- b.append(HEX.charAt(n >> 4));
- b.append(HEX.charAt(n & 0x0F));
- return b.toString();
- }
-
- private static void printUTF8(char ch) throws UnsupportedEncodingException {
- String unicode = toHex(ch >> 8) + toHex(ch & 0xFF);
- String unicodeBin = toBin(ch >> 8) + ' ' + toBin(ch & 0xFF);
-
- String s = "" + ch;
- byte[] b = s.getBytes("UTF-8");
- String hex = "";
- for (int i = 0; i < b.length; i++) {
- hex += toHex((int) b[i]);
- hex += " ";
- }
- String bin = "";
- for (int i = 0; i < b.length; i++) {
- bin += toBin((int) b[i]);
- bin += " ";
- }
- String sf = String.format("U+%s %s : %-8s : %s", unicode, unicodeBin, hex.trim(), bin.trim());
- System.out.println(sf);
- }
-
- public static void main(String[] args) throws Exception {
- printUTF8('\u002A');
- printUTF8('\u012A');
- printUTF8('\u012B');
- printUTF8('\u052C');
- printUTF8('\u013C');
- printUTF8('\uAA2A');
- printUTF8('\uFDFD');
- }
-
- }
package com.ray.utf8;
import java.io.UnsupportedEncodingException;
public class UTF8Tester {
private static String toBin(int n) {
StringBuilder b = new StringBuilder();
if (n < 0) n += 256;
for (int i = 7; i >= 0; i--) {
if (1 == ((n >> i) & 1)) {
b.append('1');
} else {
b.append('0');
}
}
return b.toString();
}
private static String HEX = "0123456789ABCDEF";
private static String toHex(int n) {
StringBuilder b = new StringBuilder();
if (n < 0) n += 256;
b.append(HEX.charAt(n >> 4));
b.append(HEX.charAt(n & 0x0F));
return b.toString();
}
private static void printUTF8(char ch) throws UnsupportedEncodingException {
String unicode = toHex(ch >> 8) + toHex(ch & 0xFF);
String unicodeBin = toBin(ch >> 8) + ' ' + toBin(ch & 0xFF);
String s = "" + ch;
byte[] b = s.getBytes("UTF-8");
String hex = "";
for (int i = 0; i < b.length; i++) {
hex += toHex((int) b[i]);
hex += " ";
}
String bin = "";
for (int i = 0; i < b.length; i++) {
bin += toBin((int) b[i]);
bin += " ";
}
String sf = String.format("U+%s %s : %-8s : %s", unicode, unicodeBin, hex.trim(), bin.trim());
System.out.println(sf);
}
public static void main(String[] args) throws Exception {
printUTF8('\u002A');
printUTF8('\u012A');
printUTF8('\u012B');
printUTF8('\u052C');
printUTF8('\u013C');
printUTF8('\uAA2A');
printUTF8('\uFDFD');
}
}
输出:
U+002A 00000000 00101010 : 2A : 00101010
U+012A 00000001 00101010 : C4 AA : 11000100 10101010
U+012B 00000001 00101011 : C4 AB : 11000100 10101011
U+052C 00000101 00101100 : D4 AC : 11010100 10101100
U+013C 00000001 00111100 : C4 BC : 11000100 10111100
U+AA2A 10101010 00101010 : EA A8 AA : 11101010 10101000 10101010
U+FDFD 11111101 11111101 : EF B7 BD : 11101111 10110111 10111101
UCS-2和UTF-8的转换,只涉及到位运算,不需要像GBK般需要查找代码表,所以转换效率很高。
先来说说UTF-8转
UCS-2:
(1)对于以0开始的字节,直接在前面部补一个0的字节凑成2个字节(即0xxxxxxx ==> 00000000 0xxxxxxxx);
(2)对于以110开始(110xxxxx)的字节,把后面紧跟着的一个10xxxxxx拿过来,首先在高位字节的左边补5个零,然后把11个“x”放在右边(即110xxxxx 10yyyyyy ==> 00000xxx xxyyyyyy);
(3)对于以1110开始(1110xxxx)的字节,把后面紧跟着的两个10xxxxxx拿过来,数一下,一共有16个“x”,没错,就是把这16个“x”组成两个字节(即1110xxxx 10yyyyyy 10zzzzzz ==> xxxxyyyy yyzzzzzz)。
在来说说
UCS-2转UTF-8:
(1)对于不大于0x007F(即00000000 01111111)的,直接把它转成一个字节,变成ASCII;
(2)对于不大于0x07FF(即00000111 11111111)的,转换成两个字节,转换的时候把右边的11位分别放到110xxxxx 10yyyyyy里边,即00000aaa bbbbbbbb ==> 110aaabb 10bbbbbb
(3)剩下的回转换成三个字节,转换的时候也是把16个位分别填写到那三个字节里面,即aaaaaaaa bbbbbbbb ==> 1110aaaa 10aaaabb 10bbcccccc
下面是转换的实现代码:
- package com.ray.utf8;
-
- import java.io.ByteArrayOutputStream;
- import java.io.UnsupportedEncodingException;
-
- public class Utf8Utils {
-
- private final static byte B_10000000 = 128 - 256;
- private final static byte B_11000000 = 192 - 256;
- private final static byte B_11100000 = 224 - 256;
- private final static byte B_11110000 = 240 - 256;
- private final static byte B_00011100 = 28;
- private final static byte B_00000011 = 3;
- private final static byte B_00111111 = 63;
- private final static byte B_00001111 = 15;
- private final static byte B_00111100 = 60;
-
-
- public static char[] toUCS2(byte[] utf8Bytes) {
- CharList charList = new CharList();
- byte b2 = 0, b3 = 0;
- int ub1 = 0, ub2 = 0;
-
- for (int i = 0; i < utf8Bytes.length; i++) {
- byte b = utf8Bytes[i];
- if (isNotHead(b)) {
-
- continue;
- } else if (b > 0) {
-
- charList.add((char) b);
- } else if ((b & B_11110000) == B_11110000) {
-
- continue;
- } else if ((b & B_11100000) == B_11100000) {
-
- b2 = utf8Bytes[i+1];
- if (!isNotHead(b2)) continue;
- i++;
- b3 = utf8Bytes[i+1];
- if (!isNotHead(b3)) continue;
- i++;
- ub1 = ((b & B_00001111) << 4) + ((b2 & B_00111100) >> 2);
- ub2 = ((b2 & B_00000011) << 6) + ((b3 & B_00111111));
- charList.add(makeChar(ub1, ub2));
- } else {
-
- b2 = utf8Bytes[i+1];
- if (!isNotHead(b2)) continue;
- i++;
- ub1 = (b & B_00011100) >> 2;
- ub2 = ((b & B_00000011) << 6) + (b2 & B_00111111);
- charList.add(makeChar(ub1, ub2));
- }
- }
-
- return charList.toArray();
- }
-
- private static boolean isNotHead(byte b) {
- return (b & B_11000000) == B_10000000;
- }
-
- private static char makeChar(int b1, int b2) {
- return (char) ((b1 << 8) + b2);
- }
-
- public static byte[] fromUCS2(char[] ucs2Array) {
- ByteArrayOutputStream baos = new ByteArrayOutputStream();
- for (int i = 0; i < ucs2Array.length; i++) {
- char ch = ucs2Array[i];
- if (ch <= 0x007F) {
- baos.write(ch);
- } else if (ch <= 0x07FF) {
- int ub1 = ch >> 8;
- int ub2 = ch & 0xFF;
- int b1 = B_11000000 + (ub1 << 2) + (ub2 >> 6);
- int b2 = B_10000000 + (ub2 & B_00111111);
- baos.write(b1);
- baos.write(b2);
- } else {
- int ub1 = ch >> 8;
- int ub2 = ch & 0xFF;
- int b1 = B_11100000 + (ub1 >> 4);
- int b2 = B_10000000 + ((ub1 & B_00001111) << 2) + (ub2 >> 6);
- int b3 = B_10000000 + (ub2 & B_00111111);
- baos.write(b1);
- baos.write(b2);
- baos.write(b3);
- }
- }
- return baos.toByteArray();
- }
-
- private static class CharList {
- private char[] data = null;
- private int used = 0;
- public void add(char c) {
- if (data == null) {
- data = new char[16];
- } else if (used >= data.length) {
- char[] temp = new char[data.length * 2];
- System.arraycopy(data, 0, temp, 0, used);
- data = temp;
- }
- data[used++] = c;
- }
- public char[] toArray() {
- char[] chars = new char[used];
- System.arraycopy(data, 0, chars, 0, used);
- return chars;
- }
- }
-
- private static void assert1(String s) throws UnsupportedEncodingException {
- byte[] b = s.getBytes("utf-8");
- char[] c = toUCS2(b);
- if (!s.equals(new String(c))) {
- throw new RuntimeException("Can not pass assert1 for: " + s);
- }
- }
-
- private static void assert2(String s) throws UnsupportedEncodingException {
- byte[] b = s.getBytes("utf-8");
- byte[] b2 = fromUCS2(s.toCharArray());
- if (b.length == b2.length) {
- int i;
- for (i = 0; i < b.length; i++) {
- if (b[i] != b2[i]) {
- break;
- }
- }
- if (i == b.length) {
- return;
- }
- }
- throw new RuntimeException("Can not pass assert2 for: " + s);
- }
-
- public static void main(String[] args) throws Exception {
- assert1("test");
- assert1("中文测试");
- assert1("A中V文c测d试E");
- assert1("\u052CA\u052CBc测");
-
- assert2("test");
- assert2("中文测试");
- assert2("A中V文c测d试E");
- assert2("\u052CA\u052CBc测\u007F\u07FF");
-
- System.out.println("pass");
- }
-
- }