多线程条件下的计数器-Bean

潜心修行bean.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

Bean_lee

博客访问： 3900920
博文数量： 146
博客积分： 3918
博客等级：少校
技术积分： 8585
用户组：普通用户
注册时间： 2010-10-17 13:52

个人简介

个人微薄： weibo.com/manuscola

文章分类

全部博文（146）

ceph（5）
Go（6）
LISP（3）
shell（5）
UI（3）
DB（13）
PHP（0）
杂文（1）
Assembly（2）
Python（1）
Linux（23）
C（7）
algorthm（32）
Linux Kernel（29）
编译链接（14）
NETWORK（2）
未分配的博文（0）

文章存档

2016年（3）

2015年（2）

2014年（5）

2013年（42）

2012年（31）

2011年（58）

2010年（5）

我的朋友

相关博文

多线程条件下的计数器

分类： LINUX

2011-11-12 11:07:40

最近编码需要实现多线程环境下的计数器操作，统计相关事件的次数。下面是一些学习心得和体会。不敢妄称原创，基本是学习笔记。遇到相关的引用，我会致谢。

当然我们知道，count++这种操作不是原子的。一个自加操作，本质是分成三步的：

1 从缓存取到寄存器

2 在寄存器加1

3 存入缓存。

由于时序的因素，多个线程操作同一个全局变量，会出现问题。这也是并发编程的难点。在目前多核条件下，这种困境会越来越彰显出来。

最简单的处理办法就是加锁保护，这也是我最初的解决方案。看下面的代码：

pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&count_lock);
global_int++;
pthread_mutex_unlock(&count_lock);

后来在网上查找资料，找到了__sync_fetch_and_add系列的命令，发现这个系列命令讲的最好的一篇文章，英文好的同学可以直接去看原文。

__sync_fetch_and_add系列一共有十二个函数，有加/减/与/或/异或/等函数的原子性操作函数,__sync_fetch_and_add,顾名思义，现fetch，然后自加，返回的是自加以前的值。以count = 4为例，调用__sync_fetch_and_add(&count,1),之后，返回值是4，然后，count变成了5.

有__sync_fetch_and_add,自然也就有__sync_add_and_fetch，呵呵这个的意思就很清楚了，先自加，在返回。他们哥俩的关系与i++和++i的关系是一样的。被谭浩强他老人家收过保护费的都会清楚了。

有了这个宝贝函数，我们就有新的解决办法了。对于多线程对全局变量进行自加，我们就再也不用理线程锁了。下面这行代码，和上面被pthread_mutex保护的那行代码作用是一样的，而且也是线程安全的。

__sync_fetch_and_add( &global_int, 1 );

下面是这群函数的全家福，大家看名字就知道是这些函数是干啥的了。

type __sync_fetch_and_add (type *ptr, type value);

type __sync_fetch_and_sub (type *ptr, type value);

type __sync_fetch_and_or (type *ptr, type value);

type __sync_fetch_and_and (type *ptr, type value);

type __sync_fetch_and_xor (type *ptr, type value);

type __sync_fetch_and_nand (type *ptr, type value);

type __sync_add_and_fetch (type *ptr, type value);

type __sync_sub_and_fetch (type *ptr, type value);

type __sync_or_and_fetch (type *ptr, type value);

type __sync_and_and_fetch (type *ptr, type value);

type __sync_xor_and_fetch (type *ptr, type value);

type __sync_nand_and_fetch (type *ptr, type value);

需要提及的是，这个type不能够瞎搞。下面看下__sync_fetch_and_add反汇编出来的指令，

804889d: f0 83 05 50 a0 04 08 lock addl $0x1,0x804a050

我们看到了，addl前面有个lock，这行汇编指令码前面是f0开头，f0叫做指令前缀，Richard Blum

老爷子将指令前缀分成了四类，有兴趣的同学可以看下。其实我也没看懂，intel的指令集太厚了，没空看。总之老爷子解释了，lock前缀的意思是对内存区域的排他性访问。

❑ Lock and repeat prefixes

❑ Segment override and branch hint prefixes

❑ Operand size override prefix

❑ Address size override prefix

前文提到，lock是锁FSB，前端串行总线，front serial bus,这个FSB是处理器和RAM之间的总线，锁住了它，就能阻止其他处理器或者core从RAM获取数据。当然这种操作是比较费的，只能操作小的内存可以这样做，想想我们有memcpy ，如果操作一大片内存，锁内存，那么代价就太昂贵了。所以前文提到的_sync_fetch_add_add家族，type只能是int long ，long long（及对应unsigned类型）。

下面提供了函数，是改造的Alexander Sundler的原文，荣誉属于他，我只是学习他的代码，稍微改动了一点点。比较了两种方式的耗时情况。呵呵咱是菜鸟，不敢枉自剽窃大师作品。向大师致敬。

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <linux/unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include<linux/types.h>
#include<time.h>
#define INC_TO 1000000 // one million...
__u64 rdtsc()
{
__u32 lo,hi;
__asm__ __volatile__
(
"rdtsc":"=a"(lo),"=d"(hi)
);
return (__u64)hi<<32|lo;
}
int global_int = 0;
pthread_mutex_t count_lock = PTHREAD_MUTEX_INITIALIZER;
pid_t gettid( void )
{
return syscall( __NR_gettid );
}
void *thread_routine( void *arg )
{
int i;
int proc_num = (int)(long)arg;
__u64 begin, end;
struct timeval tv_begin,tv_end;
__u64 timeinterval;
cpu_set_t set;
CPU_ZERO( &set );
CPU_SET( proc_num, &set );
if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
{
perror( "sched_setaffinity" );
return NULL;
}
begin = rdtsc();
gettimeofday(&tv_begin,NULL);
for (i = 0; i < INC_TO; i++)
{
// global_int++;
__sync_fetch_and_add( &global_int, 1 );
}
gettimeofday(&tv_end,NULL);
end = rdtsc();
timeinterval =(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +(tv_end.tv_usec - tv_begin.tv_usec);
fprintf(stderr,"proc_num :%d,__sync_fetch_and_add cost %llu CPU cycle,cost %llu us\n", proc_num,end-begin,timeinterval);
return NULL;
}
void *thread_routine2( void *arg )
{
int i;
int proc_num = (int)(long)arg;
__u64 begin, end;
struct timeval tv_begin,tv_end;
__u64 timeinterval;
cpu_set_t set;
CPU_ZERO( &set );
CPU_SET( proc_num, &set );
if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
{
perror( "sched_setaffinity" );
return NULL;
}
begin = rdtsc();
gettimeofday(&tv_begin,NULL);
for(i = 0;i<INC_TO;i++)
{
pthread_mutex_lock(&count_lock);
global_int++;
pthread_mutex_unlock(&count_lock);
}
gettimeofday(&tv_end,NULL);
end = rdtsc();
timeinterval =(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +(tv_end.tv_usec - tv_begin.tv_usec);
fprintf(stderr,"proc_num :%d,pthread lock cost %llu CPU cycle,cost %llu us\n",proc_num,end-begin ,timeinterval);
return NULL;
}
int main()
{
int procs = 0;
int i;
pthread_t *thrs;
// Getting number of CPUs
procs = (int)sysconf( _SC_NPROCESSORS_ONLN );
if (procs < 0)
{
perror( "sysconf" );
return -1;
}
thrs = malloc( sizeof( pthread_t ) * procs );
if (thrs == NULL)
{
perror( "malloc" );
return -1;
}
printf( "Starting %d threads...\n", procs );
for (i = 0; i < procs; i++)
{
if (pthread_create( &thrs[i], NULL, thread_routine,
(void *)(long)i ))
{
perror( "pthread_create" );
procs = i;
break;
}
}
for (i = 0; i < procs; i++)
pthread_join( thrs[i], NULL );
free( thrs );
printf( "After doing all the math, global_int value is: %d\n", global_int );
printf( "Expected value is: %d\n", INC_TO * procs );
return 0;
}

通过我的测试发现：

Starting 4 threads...
proc_num :2,no locker cost 27049544 CPU cycle,cost 12712 us
proc_num :0,no locker cost 27506750 CPU cycle,cost 12120 us
proc_num :1,no locker cost 28499000 CPU cycle,cost 13365 us
proc_num :3,no locker cost 27193093 CPU cycle,cost 12780 us
After doing all the math, global_int value is: 1169911
Expected value is: 4000000

Starting 4 threads...
proc_num :2,__sync_fetch_and_add cost 156602056 CPU cycle,cost 73603 us
proc_num :1,__sync_fetch_and_add cost 158414764 CPU cycle,cost 74456 us
proc_num :3,__sync_fetch_and_add cost 159065888 CPU cycle,cost 74763 us
proc_num :0,__sync_fetch_and_add cost 162621399 CPU cycle,cost 76426 us
After doing all the math, global_int value is: 4000000
Expected value is: 4000000

Starting 4 threads...
proc_num :1,pthread lock cost 992586450 CPU cycle,cost 466518 us
proc_num :3,pthread lock cost 1008482114 CPU cycle,cost 473998 us
proc_num :0,pthread lock cost 1018798886 CPU cycle,cost 478840 us
proc_num :2,pthread lock cost 1019083986 CPU cycle,cost 478980 us
After doing all the math, global_int value is: 4000000
Expected value is: 4000000

1 不加锁的情况下，不能返回正确的结果

测试程序结果显示，正确结果为400万，实际为1169911.

2 线程锁和原子性自加都能返回正确的结果。

3 性能上__sync_fetch_and_add,完爆线程锁。

从测试结果上看， __sync_fetch_and_add,速度是线程锁的6～7倍

参考文献：

1 主要参考了Alexander Sundler的博文

2 professional assemble language。

阅读(10126) | 评论(8) | 转发(4) |

上一篇：dentry结构及相关函数

下一篇：多线程条件下的计数器(2)

给主人留下些什么吧！~~

Heartwork2011-12-04 13:59:48

Bean_lee: 兄弟已经很厉害了，你的评论给了我很多的指点，启发很多，督促我继续深入探究下。呵呵我要向你致谢。.....

不用客气，看你的blog我也是受益良多，相互勉励吧。

回复 | 举报

Bean_lee2011-12-04 13:12:59

Heartwork: 找到了spinlock的相关实现看了一下，也是使用锁总线来保证操作的原子性，另外还使用了一个变量来保存状态。所以spinlock也就不可能比__sync_fetch_and_add更快了.....

兄弟已经很厉害了，你的评论给了我很多的指点，启发很多，督促我继续深入探究下。呵呵我要向你致谢。

回复 | 举报

Heartwork2011-12-04 12:56:53

Bean_lee: 我的测试结果，__sync_fetch_and_add处理速度是最快的，我今天测试了以下。

自旋锁
root@libin:~/program/C/thread/atom_counter# ./test
Starting 4 threads......

找到了spinlock的相关实现看了一下，也是使用锁总线来保证操作的原子性，另外还使用了一个变量来保存状态。所以spinlock也就不可能比__sync_fetch_and_add更快了……

经验主义害死人啊！

回复 | 举报

Bean_lee2011-12-03 17:01:19

Heartwork: 锁总线这个代价就太大了，这种细粒度的操作可以用spin lock，pthead库有对应的实现。.....

我的测试结果，__sync_fetch_and_add处理速度是最快的，我今天测试了以下。

自旋锁
root@libin:~/program/C/thread/atom_counter# ./test
Starting 4 threads...
proc_num :1,no locker  cost 10373366371 CPU cycle,cost 4875382 us
proc_num :3,no locker  cost 10406853947 CPU cycle,cost 4891129 us
proc_num :0,no locker  cost 10407395129 CPU cyc

回复 | 举报

Heartwork2011-11-28 23:11:33

锁总线这个代价就太大了，这种细粒度的操作可以用spin lock，pthead库有对应的实现。

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6