c++ 编码转换-lc0060305-ChinaUnix博客

李庚睿（lgr）的博客 -- 蔚蓝天空garry.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

lc0060305

博客访问： 3586719
博文数量： 1450
博客积分： 11163
博客等级：上将
技术积分： 11101
用户组：普通用户
注册时间： 2005-07-25 14:40

文章分类

全部博文（1450）

音视频直播（2）
linux各种服务器（3）
ARM学习（8）

ARM汇编指令（7）
手机开发（230）

android（2）

iphone（4）

symbian（224）
nginx 分析（6）
vi常用方法（13）
linux 常用命令（65）

linux shell 脚本（38）
window批处理资料（15）
黑客技术（20）

linux 系统安全（12）
搜索引擎与网络爬（32）
数据库技术（143）
网络技术（25）

网络测试方法（2）
操作系统研究（192）

android源码分析（1）

linux驱动（20）
程序设计（513）

调试技术（3）

测试方法（7）

性能调优（2）

debian（1）

JNI（5）

configure.ac（1）

Makefile.am（3）

设计模式（19）

算法与数据结构（4）

java程序开发（103）

web程序开发（41）
随笔（129）

地图集（14）

英语（4）

笑话（56）

我喜爱的诗（6）

我的小诗（4）
未分配的博文（54）

文章存档

2017年（5）

2014年（2）

2013年（3）

2012年（35）

2011年（39）

2010年（88）

2009年（395）

2008年（382）

2007年（241）

2006年（246）

2005年（14）

我的朋友

一、利用iconv函数族进行编码转换

在LINUX上进行编码转换时,既可以利用iconv函数族编程实现,也可以利用iconv命令来实现,只不过后者是针对文件的,即将指定文件从一种编码转换为另一种编码。

iconv函数族的头文件是iconv.h,使用前需包含之。
#include
iconv函数族有三个函数,原型如下:
(1) iconv_t iconv_open(const char *tocode, const char *fromcode);
此函数说明将要进行哪两种编码的转换,tocode是目标编码,fromcode是原编码,该函数返回一个转换句柄,供以下两个函数使用。
(2) size_t iconv(iconv_t cd,char **inbuf,size_t *inbytesleft,char **outbuf,size_t *outbytesleft);
此函数从inbuf中读取字符,转换后输出到outbuf中,inbytesleft用以记录还未转换的字符数,outbytesleft用以记录输出缓冲的剩余空间。 (3) int iconv_close(iconv_t cd);
此函数用于关闭转换句柄,释放资源。
例子1: 用C语言实现的转换示例程序

/* f.c : 代码转换示例C程序 */
#include
#define OUTLEN 255
main()
{
char *in_utf8 = "姝ｅ?ㄥ??瑁?";
char *in_gb2312 = "正在安装";
char out[OUTLEN];

//unicode码转为gb2312码
rc = u2g(in_utf8,strlen(in_utf8),out,OUTLEN);
printf("unicode-->gb2312 out=%sn",out);
//gb2312码转为unicode码
rc = g2u(in_gb2312,strlen(in_gb2312),out,OUTLEN);
printf("gb2312-->unicode out=%sn",out);
}
//代码转换:从一种编码转为另一种编码
int code_convert(char *from_charset,char *to_charset,char *inbuf,int inlen,char *outbuf,int outlen)
{
iconv_t cd;
int rc;
char **pin = &inbuf;
char **pout = &outbuf;

cd = iconv_open(to_charset,from_charset);
if (cd==0) return -1;
memset(outbuf,0,outlen);
if (iconv(cd,pin,&inlen,pout,&outlen)==-1) return -1;
iconv_close(cd);
return 0;
}
//UNICODE码转为GB2312码
int u2g(char *inbuf,int inlen,char *outbuf,int outlen)
{
return code_convert("utf-8","gb2312",inbuf,inlen,outbuf,outlen);
}
//GB2312码转为UNICODE码
int g2u(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
{
return code_convert("gb2312","utf-8",inbuf,inlen,outbuf,outlen);
}

例子2: 用C++语言实现的转换示例程序

/* f.cpp : 代码转换示例C++程序 */
#include
#include

#define OUTLEN 255

using namespace std;

// 代码转换操作类
class CodeConverter {
private:
iconv_t cd;
public:
// 构造
CodeConverter(const char *from_charset,const char *to_charset) {
cd = iconv_open(to_charset,from_charset);
}

// 析构
~CodeConverter() {
iconv_close(cd);
}

// 转换输出
int convert(char *inbuf,int inlen,char *outbuf,int outlen) {
char **pin = &inbuf;
char **pout = &outbuf;

memset(outbuf,0,outlen);
return iconv(cd,pin,(size_t *)&inlen,pout,(size_t *)&outlen);
}
};

int main(int argc, char **argv)
{
char *in_utf8 = "姝ｅ?ㄥ??瑁?";
char *in_gb2312 = "正在安装";
char out[OUTLEN];

// utf-8-->gb2312
CodeConverter cc = CodeConverter("utf-8","gb2312");
cc.convert(in_utf8,strlen(in_utf8),out,OUTLEN);
cout << "utf-8-->gb2312 in=" << in_utf8 << ",out=" << out << endl;

// gb2312-->utf-8
CodeConverter cc2 = CodeConverter("gb2312","utf-8");
cc2.convert(in_gb2312,strlen(in_gb2312),out,OUTLEN);
cout << "gb2312-->utf-8 in=" << in_gb2312 << ",out=" << out << endl;
}

二、利用iconv命令进行编码转换

iconv命令用于转换指定文件的编码,默认输出到标准输出设备,亦可指定输出文件。

用法： iconv [选项...] [文件...]

有如下选项可用:

输入/输出格式规范：
-f, --from-code=名称原始文本编码
-t, --to-code=名称输出编码

信息：
-l, --list 列举所有已知的字符集

输出控制：
-c 从输出中忽略无效的字符
-o, --output=FILE 输出文件
-s, --silent 关闭警告
--verbose 打印进度信息

-?, --help 给出该系统求助列表
--usage 给出简要的用法信息
-V, --version 打印程序版本号

例子:
iconv -f utf-8 -t gb2312 aaa.txt >bbb.txt
这个命令读取aaa.txt文件，从utf-8编码转换为gb2312编码,其输出定向到bbb.txt文件。

小结: LINUX为我们提供了强大的编码转换工具,给我们带来了方便。

glibc带了一套转码函数iconv，使用方便，可识别的码很多，如果程序需要涉及到编码之间的转换，可考虑用它。

iconv命令的用法。

$ iconv --list # 显示可识别的编码名称
$ iconv -f GB2312 -t UTF-8 a.html > b.html # 转换GB2312编码的文件a.html为UTF-8编码，存入b.html
$ iconv -f GB2312 -t BIG5 a.html > b.html # 转换GB2312编码的文件a.html为BIG5编码，存入b.html

iconv编程涉及到以下glibc库的调用：

#include 

iconv_t iconv_open(const char *tocode, const char *fromcode);
int iconv_close(iconv_t cd);

size_t iconv(iconv_t cd,
char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);

在使用iconv转码的时候，首先用iconv_open获取转码句柄，然后调用iconv转码，转完了后调用iconv_close关闭句柄。

 
 
iconv函数中：

参数cd是用iconv_open调用返回的转码句柄；
参数inbuf指向需要转码的缓冲区；
参数inbytesleft是inbuf所保存的需要转码的字节数；
参数outbuf存放转码结果；
参数outbytesleft存放outbuf空间的大小。

如果调用成功，iconv返回转换的字节数（不可逆转调用的字节数，可逆转调用的字节数不包括在内）。否则返回-1，并设置相应的errno。
iconv逐步扫描inbuf，每转换一个字符，就增加inbuf，减少inbytesleft，并将结果存入outbuf，结果字节数存入outbytesleft。遇到下列情况将停止扫描并返回：

1. 多字节序列无效，这时候errno为EILSEQ，*inbuf指向第一个无效的字符；
2. 有字节留在inbuf尚未转换，errno为EINVAL;
3. outbuf空间不够，errno为E2BIG；
4. 正常转换完备。

对于iconv函数，还有两种调用情况：

1. inbuf或者*inbuf为NULL，outbuf和*outbuf不为NULL，iconv会设置转换状态为初始状态，并保存转换序列到*outbuf。如果outbuf空间不足，errno会设置为E2BIG，返回(size_t)(-1)；
2. inbuf或者*inbuf为NULL，outbuf或者*outbuf也为NULL，iconv设置转换状态为初始状态。

iconv命令的使用固然方便，可是如果转换过程中如果遇到问题则会停止转换，有时候我们希望跳过不能转换的字节序列继续转换。以下的一段程序能实现这种功能。

/**
* siconv.cpp - A simple way to demostrate the usage of iconv calling
*
* Report bugs to
* July 15th, 2006
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include

#ifdef DEBUG
#define TRACE(fmt, args...) fprintf(stderr, "%s:%s:%d:"fmt, \
__FILE__, __FUNCTION__, __LINE__, ##args)
#else
#define TRACE(fmt, args...)
#endif

#define CONVBUF_SIZE 32767

extern int errno;

void print_err(const char *fmt, ...)
{
va_list ap;

va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
va_end(ap);
}

int print_out(const char* buf, size_t num)
{
if (num != fwrite(buf, 1, num, stdout)) {
return -1;
}

return 0;
}

void print_usage() {

print_err("Usage: siconv -f encoding -t encoding [-c] "
"input-file\n");
}

int iconv_string(const char *from, const char *to,
     const char *src, size_t len,
     ::std::string& result,
     int c = 0, size_t buf_size = 512)
{
iconv_t cd;

char *pinbuf = const_cast< char* >(src);
size_t inbytesleft = len;
char *poutbuf = NULL;
size_t outbytesleft = buf_size;

char *dst = NULL;
size_t retbytes = 0;
int done = 0;
int errno_save = 0;

if ((iconv_t)-1 == (cd = iconv_open(to, from))) {
   return -1;
}

dst = new char[buf_size];

while(inbytesleft > 0 && !done) {
   poutbuf = dst;
   outbytesleft = buf_size;

   TRACE("TARGET - in:%p pin:%p left:%d\n", src, pinbuf, inbytesleft);
   retbytes = iconv(cd, &pinbuf, &inbytesleft, &poutbuf, &outbytesleft);
   errno_save = errno;

   if (dst != poutbuf) {// we have something to write
    TRACE("OK - in:%p pin:%p left:%d done:%d buf:%d\n",
     src, pinbuf, inbytesleft, pinbuf-src, poutbuf-dst);
    result.append(dst, poutbuf-dst);
   }

   if (retbytes != (size_t)-1) {
    poutbuf = dst;
    outbytesleft = buf_size;
    (void)iconv(cd, NULL, NULL, &poutbuf, &outbytesleft);

    if (dst != poutbuf) {// we have something to write
      TRACE("OK - in:%p pin:%p left:%d done:%d buf:%d\n",
      src, pinbuf, inbytesleft, pinbuf-src, poutbuf-dst);
     result.append(dst, poutbuf-dst);
    }

    errno_save = 0;
    break;
   }


   TRACE("FAIL - in:%p pin:%p left:%d done:%d buf:%d\n",
    src, pinbuf, inbytesleft, pinbuf-src, poutbuf-dst);

   switch(errno_save) {
   case E2BIG:
    TRACE("E E2BIG\n");
    break;
   case EILSEQ:
    TRACE("E EILSEQ\n");
    if (c) {
     errno_save = 0;
     inbytesleft = len-(pinbuf-src); // forward one illegal byte
     inbytesleft--;
     pinbuf++;
     break;
    }

    done = 1;
    break;
   case EINVAL:
    TRACE("E EINVAL\n");
    done = 1;
    break;
   default:
    TRACE("E Unknown:[%d]%s\n", errno, strerror(errno));
    done = 1;
    break;
   }

}

delete[] dst;
iconv_close(cd);

errno = errno_save;
return (errno_save) ? -1 : 0;
}

int conv_file_fd(const char* from, const char *to, int fd,
     ::std::string& result, int c)
{
struct stat st;
void *start;

if (0 != fstat(fd, &st)) {
   return -1;
}

start = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);

if (MAP_FAILED == start) {
   return -1;
}

if (iconv_string(from, to, (char*)start,
   st.st_size, result, c, CONVBUF_SIZE) < 0) {
   int errno_save = errno;
   munmap(start, st.st_size);
   TRACE("\n");
   errno = errno_save;
   return -1;
}

munmap(start, st.st_size);
return 0;
}

int conv_file(const char* from, const char* to,
     const char* filename, int c)
{
::std::string result;
FILE *fp;

if (NULL == (fp=fopen(filename, "rb"))) {
   print_err("open file %s:[%d]%s\n", filename,
    errno, strerror(errno));
   return -1;
}

if (conv_file_fd(from, to, fileno(fp), result, c) < 0) {
   print_err("conv file fd:[%d]%s\n", errno, strerror(errno));
   fclose(fp);
   return -1;
}

print_out(result.data(), result.size());
fclose(fp);
return 0;
}

int main(int argc, char *argv[])
{
#ifdef TESTCASE
::std::string strA = "欢迎(welcome ^_^)来到(to)首都北京。";
::std::string strB = "大喊一声：We are chinese <=> 我们都是中国人。";

::std::string strC = strA.substr(0, 20) + strB.substr(0, 41);
::std::string result;
if (iconv_string("GBK", "UTF-8", strC.data(), strC.size(), result, 1) < 0)
{
   TRACE("ERROR [%d]%s\n", errno, strerror(errno));
}

TRACE("CONVERSION RESULT:");
result.append("\n");
print_out(result.data(), result.size());

return 0;
#else
::std::string from, to;
::std::string input_file;
int o;
int c = 0;

while (-1 != (c = getopt(argc, argv, "f:t:c")))
{
   switch(c) {
   case 'f':
    from = optarg;
    break;
   case 't':
    to = optarg;
    break;
   case 'c':
    c = 1;
    break;
   default:
    return -1;
   }
}

if (from.empty() || to.empty() || optind != (argc-1))
{
   print_usage();
   return -1;
}

input_file = argv[optind++];

return conv_file(from.c_str(), to.c_str(),
   input_file.c_str(), c);
#endif
}

可以用内存映像文件解决文件太大内存缓冲不够的情况。相对于iconv命令，加-c选项，以忽略转换过程中可能引发的问题。

$ g++ -o siconv siconv.cpp

如果在命令行加了-DDEBUG选项，会编译进调试语句，如果加了-DTESTCASE选项，则仅会编译对iconv_string函数测试的情况。

阅读(1687) | 评论(0) | 转发(0) |

上一篇：c++ 编码转换

下一篇：android 结构图

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6