Chinaunix首页 | 论坛 | 博客
  • 博客访问: 2675605
  • 博文数量: 877
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 5921
  • 用 户 组: 普通用户
  • 注册时间: 2013-12-05 12:25
个人简介

技术的乐趣在于分享,欢迎多多交流,多多沟通。

文章分类

全部博文(877)

文章存档

2021年(2)

2016年(20)

2015年(471)

2014年(358)

2013年(26)

分类: 嵌入式

2014-09-01 12:48:15

http://www.programdevelop.com/3829432/
The << Unicode GBK referrals >> 
Tags: encoding, c 

Gb2312 

Provisions: a less than 127 characters meaning is the same as the original, but the two together, it means a character greater than 127 characters in front of a byte (he called the high byte) 0xA1 used 0xF7 behind byte (low byte) 
From 0xA1 to 0xFE, so that we can combine more than about 7,000 simplified Chinese characters in these codes, we also mathematical symbols, Greek letters of Rome, Japanese kana who are programmed into it, even in ASCII already Some figures 
, Punctuation, letters and everything re-encoding of two bytes long, it is often said the "full-width characters, and the original 127 
Below those called "half-size characters. 

Chinese people see this is very good, so he took the Chinese program called "GB2312 GB2312 ASCII 
Chinese expansion. 


GBK 

But Chinese characters too much, we soon discovered that the names of many people is no way to break out here, especially some of the very state leaders will trouble others. So we have to continue to GB2312 is not used The code bits find out bluntly spend. 

Was still not enough, so simply no longer require low byte must be within 127 yards, as long as the first byte is greater than 127 indicates that this is the beginning of a Chinese character is fixed, no matter followed by extended character set is not the contents were extended after the encoding scheme is known as "GBK" standard GB2312 GBK including all content, while an increase of nearly 20,000 new characters (including traditional) and symbols. 

Note: Unicode basic knowledge related parameters on a blog. 

Unicode and GBK Huzhuan the 

3.1 GBK -> Unicode 

Unicode GBK two completely kind of character encoding scheme, the two are not directly related to their mutual conversion, the most direct and efficient way is to look-up table. 

GBK and Unicode mapping tables can be downloaded from the Internet: 


Obviously, just need to download the mapping table can be represented using a two-dimensional array tab_GBK_to_UCS2 [i] [0] the GBK code, tab_GBK_to_UCS2 [i] [1] represents the Unicode value. 
[Cpp]
  1.  Xmlns= " / / # c ---
  2.  static   const unsigned short tab_GBK_to_UCS2 [] [2] =
  3. {
  4.     / * GBK Unicode word * /   
  5.   
  6. {0x8140, 0x4E02}, / / Yu   
  7. {0x8141, 0x4E04}, / / Shang   
  8. {0x8142, 0x4E05}, / / Xia   
  9. {0x8143, 0x4E06}, / /, C   
  10. {0x8144, 0x4E0F}, / / hideaway   
  11. ......
  12. {0x817F, 0x0001}, / / XXXXX   
  13. ......
  14. };
  15.  / / # C --- end    
  1. "">// #c---  
  2. static const unsigned short tab_GBK_to_UCS2[][2] =  
  3. {  
  4.    /* GBK    Unicode      */  
  5.   
  6.     {0x8140, 0x4E02}, //   
  7.     {0x8141, 0x4E04}, //   
  8.     {0x8142, 0x4E05}, //   
  9.     {0x8143, 0x4E06}, //   
  10.     {0x8144, 0x4E0F}, //   
  11.     ... ...  
  12.     {0x817F, 0x0001}, // XXXXX  
  13.     ... ...  
  14. };  
  15. // #c---end  



But there is a problem, GBK encoding is not continuous, some coding is not meaningful, such as 0x817F 
In order to facilitate the use of array index subscript to these values ??into the array corresponding Unicode value of a non-conflicting values ??can be said, this is 0x0001 for any such GBK encoding value, we can directly The use of the array directly to find out the corresponding Unicode encoding value. Initially, I also intend to use a map to GBK 
Unicode conversion, which are only taking into account the space saving and efficient. Efficient, array, of course, did not have to say; the Tree map can log2; implemented as a hash map, selection of effective good hash function is able to achieve a constant level. save space, if the data is continuous, it is highly desirable, but not continuous, in order to continuously only be a waste point, forget, this problem space utilization 
69%; if it is a map, each node needs some space, so forget about the space utilization also about 67%. 


Will a character the GBK encoding conversion into Unicode (UCS-2 and UCS-4) encoding. 

[Cpp]
  1.  Xmlns= " / / # c ---
  2.  / ************************************************* ****************************
  3. * GBK encoding of a character is converted to Unicode (UCS-2 and UCS-4) encoding.
  4. *
  5. * Parameters:
  6. * Gbk characters GBK encoding value
  7. * UCS point to the output buffer, the saved data is Unicode encoding value,
  8. * Type to unsigned long.
  9. *
  10. * Returns:
  11. * 1. Success, returns the number of bytes occupied by the GBK encoding of the characters;
  12. * For ASCII characters to return to return for non-ASCII characters in Chinese 2.
  13. * 2. Failure 0 is returned.
  14. *
  15. * NOTE:
  16. * 1. GBK and Unicode byte order;
  17. * Byte sequence is divided into big-endian (Big Endian) and end (Little Endian) two;
  18. * Intel processors, the Little-Endian and Little-Endian said this (low address kept low)
  19. ************************************************** ************************** /   
  20.  int enc_GBK_to_unicode_one (unsigned short gbk,
  21. unsigned long * UCS)
  22. {
  23. assert (UCS! = NULL);
  24.   
  25. unsigned char * p = (unsigned char *) &gbk;
  26. unsigned char * phibyte = p + 1;
  27.   
  28.      if (* phibyte <0x80)
  29. {
  30. * Ucs = * phibyte;
  31.          return 1;
  32. }
  33.      else   
  34. {
  35.          if (gbk  [0] [0] | |
  36. gbk> tab_GBK_to_UCS2 [NUMOF_TAB_GBK_TO_UCS2 - 1] [0])
  37. {
  38.              return 0;
  39. }
  40.   
  41. * Ucs = tab_GBK_to_UCS2 [gbk - tab_GBK_to_UCS2 [0] [0]] [1];
  42. }
  43.   
  44.      return 2;
  45. }
  46.   
  47.  / / # C --- end   
  48.   
  49.   
  1. "">// #c---  
  2. /***************************************************************************** 
  3.  * GBKUnicode(UCS-2UCS-4). 
  4.  * 
  5.  * : 
  6.  *    gbk         GBK 
  7.  *    ucs         , Unicode, 
  8.  *                unsigned long . 
  9.  * 
  10.  * : 
  11.  *    1. GBK; 
  12.  *         ASCII1, ASCII2. 
  13.  *    2. 0. 
  14.  * 
  15.  * : 
  16.  *     1. GBK  Unicode ; 
  17.  *        (Big Endian)(Little Endian); 
  18.  *        Intel, . () 
  19.  ****************************************************************************/  
  20. int enc_GBK_to_unicode_one(unsigned short gbk,  
  21.         unsigned long *ucs)  
  22. {  
  23.     assert(ucs != NULL);  
  24.   
  25.     unsigned char *p = (unsigned char *) &gbk;  
  26.     unsigned char *phibyte = p + 1;  
  27.   
  28.     if ( *phibyte < 0x80 )  
  29.     {  
  30.         *ucs = *phibyte;  
  31.         return 1;  
  32.     }  
  33.     else  
  34.     {  
  35.         if ( gbk < tab_GBK_to_UCS2[0][0] ||  
  36.                 gbk > tab_GBK_to_UCS2[NUMOF_TAB_GBK_TO_UCS2 - 1][0] )  
  37.         {  
  38.             return 0;  
  39.         }  
  40.   
  41.         *ucs = tab_GBK_to_UCS2[gbk - tab_GBK_to_UCS2[0][0]][1];  
  42.     }  
  43.   
  44.     return 2;  
  45. }  
  46.   
  47. // #c---end  
  48.   
  49.   
  50.   


3.2 Unicode -> GBK 

To achieve the conversion of Unicode to GBK, you can use the above array table structure, but due to the the GBK corresponding Unicode range of values ??is too wide, it will cause a great waste, only 30% of the space utilization. Frustration can only map. 

Implemented as a hash map is a good choice. 

[Cpp]
  1.  Xmlns= "  / * ================================================ ========================== *
  2. * @ Description:
  3. * Initialize the Unicode (key) with GBK (value) of the mapping table tab_UCS2_to_GBK
  4. *
  5. * @ Returns:
  6. * Successful return 1;
  7. * Fails, returns 0.
  8. *
  9. * ================================================= ========================= * /   
  10.  static   int enc_stc_unicode_to_GBK_init ()
  11. {
  12. assert (tab_UCS2_to_GBK, == NULL);
  13.   
  14.      int i;
  15.      void * ret;
  16.   
  17. tab_UCS2_to_GBK = Table_new (21791, enc_stc_unicode_to_GBK_cmp,
  18. enc_stc_unicode_to_GBK_hash);
  19.      if (tab_UCS2_to_GBK == TABLE_ERROR)
  20.          return 0;
  21.   
  22.      for (i = 0; i
  23. {
  24.          if (tab_GBK_to_UCS2 [i] [1] == 0x0001)
  25.              continue;
  26.   
  27. unsigned long k = (unsigned long) tab_GBK_to_UCS2 [i] [1];
  28. unsigned long v = (unsigned long) tab_GBK_to_UCS2 [i] [0];
  29. ret = Table_put (tab_UCS2_to_GBK, (void *) k, (void *) v);
  30.          if (ret! = table_ok)
  31.              return 0;
  32. }
  33.   
  34.      return 1;
  35. }
  36.   
  37.  / ************************************************* ****************************
  38. * A character Unicode (UCS-2 and UCS-4) convert the encoding GBK encoding.
  39. *
  40. * Parameters:
  41. * UCS characters in Unicode encoding value
  42. * The gbk point output buffer used to store GBK encoding value pointer
  43. *
  44. * Returns:
  45. * 1. Success, returns the number of bytes occupied by the GBK encoding of the characters;
  46. * For ASCII characters to return to return for non-ASCII characters in Chinese 2.
  47. * 2. Failure 0 is returned.
  48. *
  49. * NOTE:
  50. * GKB and Unicode byte order;
  51. * Byte sequence is divided into big-endian (Big Endian) and end (Little Endian) two;
  52. * Intel processors, the Little-Endian and Little-Endian said this (low address kept low)
  53. ************************************************** ************************** /   
  54.  int enc_unicode_to_GBK_one (unsigned long ucs, unsigned short * gbk)
  55. {
  56. assert (GBK! = NULL);
  57.   
  58.      if (UCS <0x80)
  59. {
  60. * GBK = UCS;
  61.          return 1;
  62. }
  63.   
  64.      the if (tab_UCS2_to_GBK == NULL)
  65.          the if (enc_stc_unicode_to_GBK_init () == 0)
  66.              return 0;
  67.   
  68.      void * pValue;
  69.   
  70. pvalue = Table_get (tab_UCS2_to_GBK, (void *) ucs);
  71.      if (pvalue == TABLE_NO_KEY)
  72.          return 0;
  73.   
  74. * GBK = (unsigned long) pValue;
  75.   
  76.      return 2;
  77. }
  78.   
  79.  / / # C --- end   
  80.   
  81.   
  1. "">"">// #c--- /*==========================================================================* * @Description: * unicode(key)GBK(value)tab_UCS2_to_GBK * * @Returns: * , 1; * , 0. * *==========================================================================*/ static int enc_stc_unicode_to_GBK_init() { assert(tab_UCS2_to_GBK == NULL); int i; void *ret; tab_UCS2_to_GBK = Table_new(21791, enc_stc_unicode_to_GBK_cmp, enc_stc_unicode_to_GBK_hash); if ( tab_UCS2_to_GBK == TABLE_ERROR ) return 0; for ( i = 0; i < NUMOF_TAB_GBK_TO_UCS2; i++ ) { if ( tab_GBK_to_UCS2[i][1] == 0x0001 ) continue; unsigned long k = (unsigned long)tab_GBK_to_UCS2[i][1]; unsigned long v = (unsigned long)tab_GBK_to_UCS2[i][0]; ret = Table_put(tab_UCS2_to_GBK, (void*)k, (void*)v); if ( ret != TABLE_OK ) return 0; } return 1; } /***************************************************************************** * Unicode(UCS-2UCS-4)GBK. * * : * ucs Unicode * gbk GBK * * : * 1. GBK; * ASCII1, ASCII2. * 2. 0. * * : * 1. GKBUnicode; * (Big Endian)(Little Endian); * Intel, . () ****************************************************************************/ int enc_unicode_to_GBK_one(unsigned long ucs, unsigned short *gbk) { assert(gbk != NULL); if ( ucs < 0x80 ) { *gbk = ucs; return 1; } if ( tab_UCS2_to_GBK == NULL ) if ( enc_stc_unicode_to_GBK_init() == 0 ) return 0; void *pvalue; pvalue = Table_get(tab_UCS2_to_GBK, (void*)ucs); if ( pvalue == TABLE_NO_KEY ) return 0; *gbk = (unsigned long)pvalue; return 2; } // #c---end   
阅读(754) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~