构建GB2312汉字库的unicode码表-UGxxoVr-ChinaUnix博客

程序员学习

首页　| 　博文目录　| 　关于我

UGxxoVr

博客访问： 875404
博文数量： 756
博客积分： 40000
博客等级：大将
技术积分： 4980
用户组：普通用户
注册时间： 2008-10-13 14:40

文章分类

全部博文（756）

未分配的博文（756）

文章存档

2011年（1）

2008年（755）

我的朋友

最近访客

推荐博文

构建GB2312汉字库的unicode码表

分类：

2008-10-13 16:14:12

嵌入式系统总离不了处理汉字。一般汉字的处理方法是（以手机接受短信为例）：比如你收到了一封短信，该短信解码后是按照UTF-16表示，那么我们需要根据每一个汉字的unicode码，找到它在GB2312库中的位置，然后在用对应的点阵数据在屏幕上显示出来。

于是乎，必须有一种手段将unicode码和汉字字模的数据对应起来。最常用的手段是做一个unicode码表，在该数组中查找到匹配的unicode码后，用匹配的index（数组索引）值在另外一个由该index值对应的字模记录的数组中的数据去显示。
+-----------------+ 查表 +-----------------+ 同index   +-------------------+
| 汉字的unicode码 | ==> | unicode码表数组 | =======> | 汉字字模数据数组     | ==> 显示输出
+-----------------+      +-----------------+           +-------------------+
本文简要介绍一下如何生成unicode码表，其他相关的汉字处理技术不在本文的讨论范围之内。:)

用下面两个函数可以把unicode码表构造出来（*注1）：

void UnicodeToGB2312(unsigned char* pOut,unsigned short uData)
{
    WideCharToMultiByte(CP_ACP,NULL,&uData,1,pOut,sizeof(unsigned short),NULL,NULL);
    return;
}     
 
void Gb2312ToUnicode(unsigned short* pOut,unsigned char *gbBuffer)
{
    MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,gbBuffer,2,pOut,1);
    return;
}

一个简单的例子如下（随手写的一段代码，只是演示一下构造数组的过程，不要挑刺儿啊! ^_^ ）：

/*-----------------------------------------------*|  GB2312 unicode table constructor               |
|  author: Spark Song                             |
|  file  : build_uni_table.c                      |
|  date  : 2005-11-18                             |
\*-----------------------------------------------*/

#include 
#include 


void UnicodeToGB2312(unsigned char* pOut,unsigned short uData);
void Gb2312ToUnicode(unsigned short* pOut,unsigned char *gbBuffer);
void construct_unicode_table();

int main(int argc, char *argv[])
{
	construct_unicode_table();
	return 0;
}

void construct_unicode_table()
{
    #define GB2312_MATRIX   (94)
    #define DELTA           (0xA0)
    #define FONT_ROW_BEGIN (16  + DELTA)
    #define FONT_ROW_END   (87 + DELTA)
    #define FONT_COL_BEGIN (1  + DELTA)
    #define FONT_COL_END   (GB2312_MATRIX + DELTA)
    #define FONT_TOTAL     (72 * GB2312_MATRIX)

    int i, j;
    unsigned char   chr[2];
    unsigned short  uni;
    unsigned short  data[FONT_TOTAL] = {0};
    int index = 0;
    unsigned short buf;

    //生成unicode码表
    for (i=FONT_ROW_BEGIN; i<=FONT_ROW_END; i++)
        for(j=FONT_COL_BEGIN; j<=FONT_COL_END; j++)
        {
            chr[0] = i; 
            chr[1] = j;
            Gb2312ToUnicode(&uni, chr);
            data[index] = uni; index++;
        }


   //排个序，以后检索的时候就可以用binary-search了
    for (i=0;i<index-1; i++)
        for(j=i+1; j<index; j++)
            if (data[i]>data[j])
            {
                buf = data[i]; 
                data[i] = data[j];
                data[j] = buf;
            }            
    
    //输出到STD_OUT
    printf("const unsigned short uni_table[]={\n");

    for (i=0; i<index; i++)
    {
        uni = data[i];
        UnicodeToGB2312(chr, uni);

        printf("    0x%.4X%s /* GB2312 Code: 0x%.2X%.2X ==> Row:%.2d Col:%.2d */\n", 
                uni, 
                i==index-1?" ":",",
                chr[0],
                chr[1],
                chr[0] - DELTA,
                chr[1] - DELTA
                );
    }

    printf("};\n");
    return ;
}


void UnicodeToGB2312(unsigned char* pOut,unsigned short uData)
{
    WideCharToMultiByte(CP_ACP,NULL,&uData,1,pOut,sizeof(unsigned short),NULL,NULL);
    return;
}     
 
void Gb2312ToUnicode(unsigned short* pOut,unsigned char *gbBuffer)
{
    MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,gbBuffer,2,pOut,1);
    return;
}

用vc6编译后，在dos中执行：
build_uni_table.exe > report.txt
可以得到如下的txt文件：

const unsigned short  uni_table[]={
    0x4E00, /* GB2312 Code: 0xD2BB ==> Row:50 Col:27 */
    0x4E01, /* GB2312 Code: 0xB6A1 ==> Row:22 Col:01 */
    0x4E03, /* GB2312 Code: 0xC6DF ==> Row:38 Col:63 */
    0x4E07, /* GB2312 Code: 0xCDF2 ==> Row:45 Col:82 */
... ...
    0x9F9F, /* GB2312 Code: 0xB9EA ==> Row:25 Col:74 */
    0x9FA0, /* GB2312 Code: 0xD9DF ==> Row:57 Col:63 */
    0xE810, /* GB2312 Code: 0xD7FA ==> Row:55 Col:90 */
    0xE811, /* GB2312 Code: 0xD7FB ==> Row:55 Col:91 */
    0xE812, /* GB2312 Code: 0xD7FC ==> Row:55 Col:92 */
    0xE813, /* GB2312 Code: 0xD7FD ==> Row:55 Col:93 */
    0xE814  /* GB2312 Code: 0xD7FE ==> Row:55 Col:94 */};

然后把这个生成的数组copy到项目代码中使用就okey了。hoho，其实在开发中编写代码来构造代码的机会很多，coder不用coding辅助自己开发多浪费啊～ :)

--------------------------------
注1：
关于内码转换的相关知识可参考vckbase document online上的两篇资料：
1) 《》
2) 《》
-------------
乾坤一笑写于2005年11月29日转载请标明出处和原文链接
--------------------next---------------------

阅读(511) | 评论(0) | 转发(0) |

上一篇：vim将源代码转换为彩色语法加亮的html文档

下一篇：解惑：sizeof(联合)这个值是怎么计算的

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6