Unicode GBK referrals-jeffasdasd-ChinaUnix博客

jeffasdasd

首页　| 　博文目录　| 　关于我

jeffasdasd

博客访问： 2713836
博文数量： 877
博客积分： 0
博客等级：民兵
技术积分： 5921
用户组：普通用户
注册时间： 2013-12-05 12:25

个人简介

技术的乐趣在于分享，欢迎多多交流，多多沟通。

文章分类

全部博文（877）

ffmpeg（1）
JAVA（0）
HTML5（1）
Android（0）
IOS（325）

iOS UI （12）

IOS 综合（105）

Swift（0）

IOS网络（7）

iOS多线程（21）

iOS UI （29）

Object-C（38）
C++（7）
windows dri（84）
IIC（5）
销售（6）
蓝牙4.0 BLE（5）
SD卡及SDIO卡（3）
数据结构（7）
算法（8）
C语言自己发现的（6）
Smart Card（13）
杂谈（12）
字库（41）
SPI（20）
USB（41）
面试试题（12）
软件开发（40）

优秀的博客地址（8）
Linux--C语言（121）
Linux 内核（14）
Uboot移植--转载（6）
linux驱动（18）
linux内核源码分（4）
linux内核移植（17）
文件系统（2）
ARM 硬件（15）
linux Ubunt（5）
uboot（32）
ARM 向量中（6）
未分配的博文（0）

文章存档

2021年（2）

2016年（20）

2015年（471）

2014年（358）

2013年（26）

我的朋友

相关博文

Unicode GBK referrals

分类：嵌入式

2014-09-01 12:48:15

http://www.programdevelop.com/3829432/
The << Unicode GBK referrals >>
Tags: encoding, c

Gb2312

Provisions: a less than 127 characters meaning is the same as the original, but the two together, it means a character greater than 127 characters in front of a byte (he called the high byte) 0xA1 used 0xF7 behind byte (low byte)
From 0xA1 to 0xFE, so that we can combine more than about 7,000 simplified Chinese characters in these codes, we also mathematical symbols, Greek letters of Rome, Japanese kana who are programmed into it, even in ASCII already Some figures
, Punctuation, letters and everything re-encoding of two bytes long, it is often said the "full-width characters, and the original 127
Below those called "half-size characters.

Chinese people see this is very good, so he took the Chinese program called "GB2312 GB2312 ASCII
Chinese expansion.

GBK

But Chinese characters too much, we soon discovered that the names of many people is no way to break out here, especially some of the very state leaders will trouble others. So we have to continue to GB2312 is not used The code bits find out bluntly spend.

Was still not enough, so simply no longer require low byte must be within 127 yards, as long as the first byte is greater than 127 indicates that this is the beginning of a Chinese character is fixed, no matter followed by extended character set is not the contents were extended after the encoding scheme is known as "GBK" standard GB2312 GBK including all content, while an increase of nearly 20,000 new characters (including traditional) and symbols.

Note: Unicode basic knowledge related parameters on a blog.

Unicode and GBK Huzhuan the

3.1 GBK -> Unicode

Unicode GBK two completely kind of character encoding scheme, the two are not directly related to their mutual conversion, the most direct and efficient way is to look-up table.

GBK and Unicode mapping tables can be downloaded from the Internet:

Obviously, just need to download the mapping table can be represented using a two-dimensional array tab_GBK_to_UCS2 [i] [0] the GBK code, tab_GBK_to_UCS2 [i] [1] represents the Unicode value.

			[Cpp]
		
			 Xmlns= " / / # c ---
		
			 static   const unsigned short tab_GBK_to_UCS2 [] [2] =
		
			{
		
			    / * GBK Unicode word * /   
		
			{0x8140, 0x4E02}, / / Yu   
		
			{0x8141, 0x4E04}, / / Shang   
		
			{0x8142, 0x4E05}, / / Xia   
		
			{0x8143, 0x4E06}, / /, C   
		
			{0x8144, 0x4E0F}, / / hideaway   
		
			......
		
			{0x817F, 0x0001}, / / XXXXX   
		
			......
		
			};
		
			 / / # C --- end

			view plain
		
			"">// #c---  
		
			static const unsigned short tab_GBK_to_UCS2[][2] =  
		
			{  
		
			   /* GBK    Unicode      */  
		
			    {0x8140, 0x4E02}, //   
		
			    {0x8141, 0x4E04}, //   
		
			    {0x8142, 0x4E05}, //   
		
			    {0x8143, 0x4E06}, //   
		
			    {0x8144, 0x4E0F}, //   
		
			    ... ...  
		
			    {0x817F, 0x0001}, // XXXXX  
		
			    ... ...  
		
			};  
		
			// #c---end

But there is a problem, GBK encoding is not continuous, some coding is not meaningful, such as 0x817F
In order to facilitate the use of array index subscript to these values ??into the array corresponding Unicode value of a non-conflicting values ??can be said, this is 0x0001 for any such GBK encoding value, we can directly The use of the array directly to find out the corresponding Unicode encoding value. Initially, I also intend to use a map to GBK
Unicode conversion, which are only taking into account the space saving and efficient. Efficient, array, of course, did not have to say; the Tree map can log2; implemented as a hash map, selection of effective good hash function is able to achieve a constant level. save space, if the data is continuous, it is highly desirable, but not continuous, in order to continuously only be a waste point, forget, this problem space utilization
69%; if it is a map, each node needs some space, so forget about the space utilization also about 67%.

Will a character the GBK encoding conversion into Unicode (UCS-2 and UCS-4) encoding.

			[Cpp]
		
			 Xmlns= " / / # c ---
		
			 / ************************************************* ****************************
		
			* GBK encoding of a character is converted to Unicode (UCS-2 and UCS-4) encoding.
		
			*
		
			* Parameters:
		
			* Gbk characters GBK encoding value
		
			* UCS point to the output buffer, the saved data is Unicode encoding value,
		
			* Type to unsigned long.
		
			*
		
			* Returns:
		
			* 1. Success, returns the number of bytes occupied by the GBK encoding of the characters;
		
			* For ASCII characters to return to return for non-ASCII characters in Chinese 2.
		
			* 2. Failure 0 is returned.
		
			*
		
			* NOTE:
		
			* 1. GBK and Unicode byte order;
		
			* Byte sequence is divided into big-endian (Big Endian) and end (Little Endian) two;
		
			* Intel processors, the Little-Endian and Little-Endian said this (low address kept low)
		
			************************************************** ************************** /   
		
			 int enc_GBK_to_unicode_one (unsigned short gbk,
		
			unsigned long * UCS)
		
			{
		
			assert (UCS! = NULL);
		
			unsigned char * p = (unsigned char *) &gbk;
		
			unsigned char * phibyte = p + 1;
		
			     if (* phibyte <0x80)
		
			{
		
			* Ucs = * phibyte;
		
			         return 1;
		
			}
		
			     else   
		
			{
		
			         if (gbk  [0] [0] | |
		
			gbk> tab_GBK_to_UCS2 [NUMOF_TAB_GBK_TO_UCS2 - 1] [0])
		
			{
		
			             return 0;
		
			}
		
			* Ucs = tab_GBK_to_UCS2 [gbk - tab_GBK_to_UCS2 [0] [0]] [1];
		
			}
		
			     return 2;
		
			}
		
			 / / # C --- end

			view plain
		
			"">// #c---  
		
			/***************************************************************************** 
		
			 * GBKUnicode(UCS-2UCS-4). 
		
			 * 
		
			 * : 
		
			 *    gbk         GBK 
		
			 *    ucs         , Unicode, 
		
			 *                unsigned long . 
		
			 * 
		
			 * : 
		
			 *    1. GBK; 
		
			 *         ASCII1, ASCII2. 
		
			 *    2. 0. 
		
			 * 
		
			 * : 
		
			 *     1. GBK  Unicode ; 
		
			 *        (Big Endian)(Little Endian); 
		
			 *        Intel, . () 
		
			 ****************************************************************************/  
		
			int enc_GBK_to_unicode_one(unsigned short gbk,  
		
			        unsigned long *ucs)  
		
			{  
		
			    assert(ucs != NULL);  
		
			    unsigned char *p = (unsigned char *) &gbk;  
		
			    unsigned char *phibyte = p + 1;  
		
			    if ( *phibyte < 0x80 )  
		
			    {  
		
			        *ucs = *phibyte;  
		
			        return 1;  
		
			    }  
		
			    else  
		
			    {  
		
			        if ( gbk < tab_GBK_to_UCS2[0][0] ||  
		
			                gbk > tab_GBK_to_UCS2[NUMOF_TAB_GBK_TO_UCS2 - 1][0] )  
		
			        {  
		
			            return 0;  
		
			        }  
		
			        *ucs = tab_GBK_to_UCS2[gbk - tab_GBK_to_UCS2[0][0]][1];  
		
			    }  
		
			    return 2;  
		
			}  
		
			// #c---end

3.2 Unicode -> GBK

To achieve the conversion of Unicode to GBK, you can use the above array table structure, but due to the the GBK corresponding Unicode range of values ??is too wide, it will cause a great waste, only 30% of the space utilization. Frustration can only map.

Implemented as a hash map is a good choice.

			[Cpp]
		
			 Xmlns= " 
			 / * ================================================ ========================== *
		
			* @ Description:
		
			* Initialize the Unicode (key) with GBK (value) of the mapping table tab_UCS2_to_GBK
		
			*
		
			* @ Returns:
		
			* Successful return 1;
		
			* Fails, returns 0.
		
			*
		
			* ================================================= ========================= * /   
		
			 static   int enc_stc_unicode_to_GBK_init ()
		
			{
		
			assert (tab_UCS2_to_GBK, == NULL);
		
			     int i;
		
			     void * ret;
		
			tab_UCS2_to_GBK = Table_new (21791, enc_stc_unicode_to_GBK_cmp,
		
			enc_stc_unicode_to_GBK_hash);
		
			     if (tab_UCS2_to_GBK == TABLE_ERROR)
		
			         return 0;
		
			     for (i = 0; i 
		
			{
		
			         if (tab_GBK_to_UCS2 [i] [1] == 0x0001)
		
			             continue;
		
			unsigned long k = (unsigned long) tab_GBK_to_UCS2 [i] [1];
		
			unsigned long v = (unsigned long) tab_GBK_to_UCS2 [i] [0];
		
			ret = Table_put (tab_UCS2_to_GBK, (void *) k, (void *) v);
		
			         if (ret! = table_ok)
		
			             return 0;
		
			}
		
			     return 1;
		
			}
		
			 / ************************************************* ****************************
		
			* A character Unicode (UCS-2 and UCS-4) convert the encoding GBK encoding.
		
			*
		
			* Parameters:
		
			* UCS characters in Unicode encoding value
		
			* The gbk point output buffer used to store GBK encoding value pointer
		
			*
		
			* Returns:
		
			* 1. Success, returns the number of bytes occupied by the GBK encoding of the characters;
		
			* For ASCII characters to return to return for non-ASCII characters in Chinese 2.
		
			* 2. Failure 0 is returned.
		
			*
		
			* NOTE:
		
			* GKB and Unicode byte order;
		
			* Byte sequence is divided into big-endian (Big Endian) and end (Little Endian) two;
		
			* Intel processors, the Little-Endian and Little-Endian said this (low address kept low)
		
			************************************************** ************************** /   
		
			 int enc_unicode_to_GBK_one (unsigned long ucs, unsigned short * gbk)
		
			{
		
			assert (GBK! = NULL);
		
			     if (UCS <0x80)
		
			{
		
			* GBK = UCS;
		
			         return 1;
		
			}
		
			     the if (tab_UCS2_to_GBK == NULL)
		
			         the if (enc_stc_unicode_to_GBK_init () == 0)
		
			             return 0;
		
			     void * pValue;
		
			pvalue = Table_get (tab_UCS2_to_GBK, (void *) ucs);
		
			     if (pvalue == TABLE_NO_KEY)
		
			         return 0;
		
			* GBK = (unsigned long) pValue;
		
			     return 2;
		
			}
		
			 / / # C --- end

			view plain
		
			"">"">// #c--- /*==========================================================================* * @Description: * unicode(key)GBK(value)tab_UCS2_to_GBK * * @Returns: * , 1; * , 0. * *==========================================================================*/ static int enc_stc_unicode_to_GBK_init() { assert(tab_UCS2_to_GBK == NULL); int i; void *ret; tab_UCS2_to_GBK = Table_new(21791, enc_stc_unicode_to_GBK_cmp, enc_stc_unicode_to_GBK_hash); if ( tab_UCS2_to_GBK == TABLE_ERROR ) return 0; for ( i = 0; i < NUMOF_TAB_GBK_TO_UCS2; i++ ) { if ( tab_GBK_to_UCS2[i][1] == 0x0001 ) continue; unsigned long k = (unsigned long)tab_GBK_to_UCS2[i][1]; unsigned long v = (unsigned long)tab_GBK_to_UCS2[i][0]; ret = Table_put(tab_UCS2_to_GBK, (void*)k, (void*)v); if ( ret != TABLE_OK ) return 0; } return 1; } /***************************************************************************** * Unicode(UCS-2UCS-4)GBK. * * : * ucs Unicode * gbk GBK * * : * 1. GBK; * ASCII1, ASCII2. * 2. 0. * * : * 1. GKBUnicode; * (Big Endian)(Little Endian); * Intel, . () ****************************************************************************/ int enc_unicode_to_GBK_one(unsigned long ucs, unsigned short *gbk) { assert(gbk != NULL); if ( ucs < 0x80 ) { *gbk = ucs; return 1; } if ( tab_UCS2_to_GBK == NULL ) if ( enc_stc_unicode_to_GBK_init() == 0 ) return 0; void *pvalue; pvalue = Table_get(tab_UCS2_to_GBK, (void*)ucs); if ( pvalue == TABLE_NO_KEY ) return 0; *gbk = (unsigned long)pvalue; return 2; } // #c---end

阅读(764) | 评论(0) | 转发(0) |

上一篇：WideCharToMultiByte

下一篇：【程序】WideCharToMultiByte和MultiByteToWideChar函数的用法

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6