http://www.programdevelop.com/3829432/
The << Unicode GBK referrals >>
Tags: encoding, c
Gb2312
Provisions: a less than 127 characters meaning is the same as the original, but the two together, it means a character greater than 127 characters in front of a byte (he called the high byte) 0xA1 used 0xF7 behind byte (low byte)
From 0xA1 to 0xFE, so that we can combine more than about 7,000 simplified Chinese characters in these codes, we also mathematical symbols, Greek letters of Rome, Japanese kana who are programmed into it, even in ASCII already Some figures
, Punctuation, letters and everything re-encoding of two bytes long, it is often said the "full-width characters, and the original 127
Below those called "half-size characters.
Chinese people see this is very good, so he took the Chinese program called "GB2312 GB2312 ASCII
Chinese expansion.
GBK
But Chinese characters too much, we soon discovered that the names of many people is no way to break out here, especially some of the very state leaders will trouble others. So we have to continue to GB2312 is not used The code bits find out bluntly spend.
Was still not enough, so simply no longer require low byte must be within 127 yards, as long as the first byte is greater than 127 indicates that this is the beginning of a Chinese character is fixed, no matter followed by extended character set is not the contents were extended after the encoding scheme is known as "GBK" standard GB2312 GBK including all content, while an increase of nearly 20,000 new characters (including traditional) and symbols.
Note: Unicode basic knowledge related parameters on a blog.
Unicode and GBK Huzhuan the
3.1 GBK -> Unicode
Unicode GBK two completely kind of character encoding scheme, the two are not directly related to their mutual conversion, the most direct and efficient way is to look-up table.
GBK and Unicode mapping tables can be downloaded from the Internet:
Obviously, just need to download the mapping table can be represented using a two-dimensional array tab_GBK_to_UCS2 [i] [0] the GBK code, tab_GBK_to_UCS2 [i] [1] represents the Unicode value.
-
Xmlns= " / / # c ---
-
static const unsigned short tab_GBK_to_UCS2 [] [2] =
-
{
-
-
-
{0x8140, 0x4E02},
-
{0x8141, 0x4E04},
-
{0x8142, 0x4E05},
-
{0x8143, 0x4E06},
-
{0x8144, 0x4E0F},
-
......
-
{0x817F, 0x0001},
-
......
-
};
-
-
"">
-
static const unsigned short tab_GBK_to_UCS2[][2] =
-
{
-
-
-
{0x8140, 0x4E02},
-
{0x8141, 0x4E04},
-
{0x8142, 0x4E05},
-
{0x8143, 0x4E06},
-
{0x8144, 0x4E0F},
-
... ...
-
{0x817F, 0x0001},
-
... ...
-
};
-
But there is a problem, GBK encoding is not continuous, some coding is not meaningful, such as 0x817F
In order to facilitate the use of array index subscript to these values ??into the array corresponding Unicode value of a non-conflicting values ??can be said, this is 0x0001 for any such GBK encoding value, we can directly The use of the array directly to find out the corresponding Unicode encoding value. Initially, I also intend to use a map to GBK
Unicode conversion, which are only taking into account the space saving and efficient. Efficient, array, of course, did not have to say; the Tree map can log2; implemented as a hash map, selection of effective good hash function is able to achieve a constant level. save space, if the data is continuous, it is highly desirable, but not continuous, in order to continuously only be a waste point, forget, this problem space utilization
69%; if it is a map, each node needs some space, so forget about the space utilization also about 67%.
Will a character the GBK encoding conversion into Unicode (UCS-2 and UCS-4) encoding.
-
Xmlns= " / / # c ---
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
int enc_GBK_to_unicode_one (unsigned short gbk,
-
unsigned long * UCS)
-
{
-
assert (UCS! = NULL);
-
-
unsigned char * p = (unsigned char *) &gbk;
-
unsigned char * phibyte = p + 1;
-
-
if (* phibyte <0x80)
-
{
-
* Ucs = * phibyte;
-
return 1;
-
}
-
else
-
{
-
if (gbk [0] [0] | |
-
gbk> tab_GBK_to_UCS2 [NUMOF_TAB_GBK_TO_UCS2 - 1] [0])
-
{
-
return 0;
-
}
-
-
* Ucs = tab_GBK_to_UCS2 [gbk - tab_GBK_to_UCS2 [0] [0]] [1];
-
}
-
-
return 2;
-
}
-
-
-
-
-
SPAN>
-
"">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
int enc_GBK_to_unicode_one(unsigned short gbk,
-
unsigned long *ucs)
-
{
-
assert(ucs != NULL);
-
-
unsigned char *p = (unsigned char *) &gbk;
-
unsigned char *phibyte = p + 1;
-
-
if ( *phibyte < 0x80 )
-
{
-
*ucs = *phibyte;
-
return 1;
-
}
-
else
-
{
-
if ( gbk < tab_GBK_to_UCS2[0][0] ||
-
gbk > tab_GBK_to_UCS2[NUMOF_TAB_GBK_TO_UCS2 - 1][0] )
-
{
-
return 0;
-
}
-
-
*ucs = tab_GBK_to_UCS2[gbk - tab_GBK_to_UCS2[0][0]][1];
-
}
-
-
return 2;
-
}
-
-
-
-
-
3.2 Unicode -> GBK
To achieve the conversion of Unicode to GBK, you can use the above array table structure, but due to the the GBK corresponding Unicode range of values ??is too wide, it will cause a great waste, only 30% of the space utilization. Frustration can only map.
Implemented as a hash map is a good choice.
-
Xmlns= "
-
-
-
-
-
-
-
-
-
static int enc_stc_unicode_to_GBK_init ()
-
{
-
assert (tab_UCS2_to_GBK, == NULL);
-
-
int i;
-
void * ret;
-
-
tab_UCS2_to_GBK = Table_new (21791, enc_stc_unicode_to_GBK_cmp,
-
enc_stc_unicode_to_GBK_hash);
-
if (tab_UCS2_to_GBK == TABLE_ERROR)
-
return 0;
-
-
for (i = 0; i
-
{
-
if (tab_GBK_to_UCS2 [i] [1] == 0x0001)
-
continue;
-
-
unsigned long k = (unsigned long) tab_GBK_to_UCS2 [i] [1];
-
unsigned long v = (unsigned long) tab_GBK_to_UCS2 [i] [0];
-
ret = Table_put (tab_UCS2_to_GBK, (void *) k, (void *) v);
-
if (ret! = table_ok)
-
return 0;
-
}
-
-
return 1;
-
}
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
int enc_unicode_to_GBK_one (unsigned long ucs, unsigned short * gbk)
-
{
-
assert (GBK! = NULL);
-
-
if (UCS <0x80)
-
{
-
* GBK = UCS;
-
return 1;
-
}
-
-
the if (tab_UCS2_to_GBK == NULL)
-
the if (enc_stc_unicode_to_GBK_init () == 0)
-
return 0;
-
-
void * pValue;
-
-
pvalue = Table_get (tab_UCS2_to_GBK, (void *) ucs);
-
if (pvalue == TABLE_NO_KEY)
-
return 0;
-
-
* GBK = (unsigned long) pValue;
-
-
return 2;
-
}
-
-
-
-
-
SPAN> SPAN>
阅读(754) | 评论(0) | 转发(0) |