用Python处理CSV格式-UGxxoVr-ChinaUnix博客

程序员学习

首页　| 　博文目录　| 　关于我

UGxxoVr

博客访问： 875447
博文数量： 756
博客积分： 40000
博客等级：大将
技术积分： 4980
用户组：普通用户
注册时间： 2008-10-13 14:40

文章分类

全部博文（756）

未分配的博文（756）

文章存档

2011年（1）

2008年（755）

我的朋友

最近访客

推荐博文

用Python处理CSV格式

分类：

2008-10-13 16:14:26

这里CSV格式，就是一种用','分割的字符串格式。进入正文前，偶先讲件小事情。

公司发上年的XX奖，一封notes邮件发来，收件人竟有几百人之多。偶有心统计一下具体人数，想瞅瞅这茫茫人海中是否有个叫smileonce的人的份子，怎么办？盯着屏幕细细的数么？

偶才没有那么笨呢，当然不会浪费那么多时间。不然公司雇偶干活岂不是太花冤枉钱么？:)
coder当然有coder的方法了，那就是花1分钟时间，用program来搞定。(1分钟，别不信，其实我已经把时间估计的很充裕了。)

notes收件人的格式是： zhang3@xxx.net,li4@xxx.net,wang5@xxx,net,...,wanyan108@xxx.net,xiahou210@xxx.net
就是这么一个长字符串，偶把它copy到剪贴板里面先。

然后偶祭起法宝PythonWin:
>>> str = "zhang3@xxx.net,li4@xxx.net,wang5@xxx,net,...,wanyan108@xxx.net,xiahou210@xxx.net"
>>> lst = str.split(',')
>>> print "The count of person is : %i" % len(lst)
The count of person is : 766

>>> #assign my name to x
>>> x = smileonce@xxx.net
>>> if x in lst:
... print "%s is in the list, index is: %i" % ( x, lst.index(x))
... else:
... print "Sorry, %s is NOT in the list !" % x
...
Sorry, smileonce@xxx.net is NOT in the list !

好遗憾呀，又没有发成财！:p
（本文例子一半为虚构，如有雷同绝对是巧合！）

好了，讲到这里，估计偶也不用说怎么操作CSV了，那都是同样的小儿科。偶想说的是，语言本无孰优孰略，关键在于你能不能游刃有余！

-------------
乾坤一笑写于2005年5月16日转载请标明出处和原文链接

--------------------next---------------------
#include
#include
using namespace std;

int main( void )
{
const char str[] = "zhang3@xxx.net,smileonce@xxx.net,li4@xxx.net,wang5@xxx,wanyan108@xxx.net,xiahou210@xxx.net";
// lst = str.split(',')
// print "The count of person is : %i" % len(lst)
cout << "The count of person is : " << count( str, str+sizeof(str)/sizeof(str[0])-1, ',' )+1 << endl;

// x = smileonce@xxx.net
const char substr[] = "smileonce@xxx.net";
// if x in lst:
const char* p = search( str, str+sizeof(str)/sizeof(str[0])-1, substr, substr+sizeof(substr)/sizeof(substr[0])-1 );
if( p != str+sizeof(str)/sizeof(str[0]) )
//print "%s is in the list, index is: %i" % ( x, lst.index(x))
cout << substr << " is in the list, index is: " << count( str, p, ',' ) << endl;
else
//print "Sorry, %s is NOT in the list !" % x
cout << "Sorry, " << substr << " is NOT in the list !" << endl;

return 0;
}
--------------------next---------------------
其实你想过没有，无论是Python还是jsp的split，其实都有一个问题，即使split的返回值类型只存储各个substr在sourcestr中的索引，那么如果在sourcestr非常巨大的情况下，耗费的内存也是不可小视的，并且它们是冗余的。
如果使用C语言函数strtok，不但只需要遍历一次，而且不消耗大量的内存。

#include
#include
using namespace std;

int main( void )
{
    char str[] = "zhang3@xxx.net,smileonce@xxx.net,li4@xxx.net,wang5@xxx,wanyan108@xxx.net,xiahou210@xxx.net";
    char substr[] = "smileonce@xxx.net";

    size_t sum = 0;
    size_t index = -1;
    for( char* token=strtok(str,","); token!=0; ++sum, token=strtok(0,",") )
        if( strcmp(token,substr) == 0 ) index = sum+1;

    cout << "The count of person is : " << sum << endl;
    if( index != -1 )
        cout << substr << " is in the list, index is: " << index << endl;
    else
        cout << "Sorry, " << substr << " is NOT in the list !" << endl;

    return 0;
}

--------------------next---------------------

#include
#include
#include

using namespace std;
using namespace boost;

int main()
{
    string str = "zhang3@xxx.net,smileonce@xxx.net,li4@xxx.net,wang5@xxx,wanyan108@xxx.net,xiahou210@xxx.net";
    string substr = "smileonce@xxx.net";

    char_separator sep(",");

    typedef tokenizer > char_token;
    char_token tokens(str, sep);

    int index = -1;
    int sum = 0;
    for (char_token::iterator tok_iter = tokens.begin(); tok_iter != tokens.end(); ++tok_iter) {
        std::cout << sum << ": " << *tok_iter << "\n";
        if( *tok_iter == substr)
            index = sum;
        sum++;
    }

    cout << "The count of person is : " << sum << endl;
    if( index != -1 )
        cout << substr << " is in the list, index is: " << index << endl;
    else
        cout << "Sorry, " << substr << " is NOT in the list !" << endl;
    return 0;
}

//cl /GX /MD /I"D:\boost\boost_1_32_0" token.cpp

--------------------next---------------------
谢谢各位给出对比程序，受益非浅！
回答星星的疑问:偶并非不会用C写，偶只是不想在解决这种小事浪费时间(初步判断，偶1分钟内用C搞不定，要调试；而这种场合用Python只需要几十秒钟就可以解决问题，并且可以保证不会隐藏的bug)
另外，Python的运行效率确实很低，比java还低。所以它常应用于不要求运行效率，但急需开发效率的场合。
如，今天，偶在偶的C代码里面添加一个复杂的数组，用于定义一个128Mb的ROM的地址分配，以16×64K为一个block，来生成具有128个元素的地址序列。基本如下：
int addr = {
  { 0x0000000, 0x0010000, 0x0020000, ...},
  ....
}
这个东西如果用手写不知需要多少时间，用copy/paste的方法又容易改错，用C编码生成它又太大材小用浪费时间去调试。这时候偶就用Python,写3个循环嵌套来生成这个数组的代码。基本上不用花多少时间，也是1分钟左右搞定。:)
当然，Python之强大也不仅局限于此，不过因为我也是边学边卖，所以估计也是越用越熟，慢慢把Python的长处逐步发挥出来。
--------------------next---------------------
boost 只要编译器支持，只看文档，不看具体实现，其实用起来相当顺手，有点script的味道。可惜俺用的vc6太差了，很多不支持。

我觉得大部分c++程序员有个毛病就是总想看是如何实现的，结果看boost看得头晕也不甚明白，说实话有些boost库没有很深的模板功底确实是天书一般，而且对编译器要求甚高，可能也影响了它的使用范围。

不过毕竟boost不一般，这些c++社区中的精英们写出来的东西我还是很信任的，有时候就当自己是个白痴，直接用就是了。你知道python底层是如何实现split的吗？估计不知道。因此你也可以不要知道boost/tokenizer是如何实现的。难的糊涂！

split可能不会被标准化，毕竟只要有个iterator range，直接调用std::copy + back_inserter 就可以将它置入一个容器中。
--------------------next---------------------

阅读(635) | 评论(0) | 转发(0) |

上一篇：胡言乱语之link、define、declare、extern、static

下一篇：抄书：C语言中字符类型使用中的几个误区

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6