cacheline_aligned-robert_ch-ChinaUnix博客

Robert

首页　| 　博文目录　| 　关于我

robert_ch

博客访问： 508330
博文数量： 144
博客积分： 0
博客等级：民兵
技术积分： 508
用户组：普通用户
注册时间： 2014-09-10 13:18

个人简介

Keep looking Donot settle

文章分类

全部博文（144）

算法-协议（3）
Python（3）
OpenStack（8）
WEBServer（8）
classic（19）
SNMP（4）
wlan_driver（1）
Tools1（7）
Memory（1）
ethernet_driver（13）
other（16）
syslog（15）
kernel（20）
IPTABLES（22）
网络套接字（4）

raw（0）
未分配的博文（0）

文章存档

2019年（1）

2016年（31）

2015年（51）

2014年（61）

我的朋友

相关博文

cacheline_aligned

分类： LINUX

2014-10-28 14:28:20

原文地址：cacheline_aligned 作者：sigsegv11

刚好找到一台可用的8 core cpu，测试多线程版本的使用的线程均为8个。

考虑一个简单的并发计数器程序（仅仅用于演示，很简单的问题）

single threaded:

#include <stdio.h> static unsigned long counter; #define LOOP_TOTAL (1UL << 31) int main(int argc, char* argv[]) { unsigned long i; for (i = 0; i < LOOP_TOTAL; i++) { ++counter; } printf("counter = %lu\n", counter); return 0; }

$ time ./counter-st
counter = 2147483648

real    0m2.245s
user    0m2.244s
sys     0m0.000s

如果采用多线程？怎么来做？

try 1: 使用互斥变量

pthread_mutex_lock(&counter_lock); ++counter; pthread_mutex_unlock(&counter_lock);

pthread_mutex_lock() / pthread_mutex_unlock() 开销比较大，尤其当core很多的时候，绝大多数时间都花在了临界区。互斥访问(mutual exclusion)是非常昂贵的，详情可以参见Lamport's bakery algorithm(derived from Peterson Lock)，其复杂度正比于线程的数量。

$ time ./counter-mt1
counter = 2147483648

real    4m59.876s
user    3m8.431s
sys     34m58.903s

try 2: 使用原子操作

atomic_inc(&counter);

没有了软件实现的临界区保护，保护机制有硬件实现，很大程度上减少了互斥访问开销，但硬件互斥访问也不是“免费”的。

$ time ./counter-mt2
counter = 2147483648

real    1m2.040s
user    7m19.202s
sys     0m0.030s

try 3: 每个线程使用独立的counter，total_counter =
foreach local_counter in thread_local_counters:
    total_counter <- total_counter + local_counter;

这时候完全避免了需要使用互斥/锁的使用，因此效率进一步得到提高：

$ time ./counter-mt3
counter = 2147483648

real    0m0.300s
user    0m2.241s
sys     0m0.001s

try 4: 每个线程使用独立的counter，但CPU的cache line通常为32/64 bytes, 因此cache访问数据都是基于cacheline，而不是字节或是WORD。多个处理器共享一个cache line毫无意义(false sharing)，而且会降低性能，因此我们应该考虑将counter对其到一个cache line。

#define ____cacheline_aligned __attribute__((__aligned__(64))) #define ____cacheline_aligned_in_smp __attribute__((__aligned__(64)))

struct per_cpu_data { unsigned long counter; } ____cacheline_aligned_in_smp;

由于消除了false cache line sharing，效率得以进一步提高（测试比较弱智，显示不出来了）。
（如果观察Linux内核的代码，Linux都用了这种机制用于消除cache false sharing）

$ time ./counter-mt4
counter = 2147483648

real    0m0.287s
user    0m2.244s
sys     0m0.000s

阅读(1543) | 评论(0) | 转发(0) |

上一篇：关于SIGCHLD信号的处理

下一篇：数据包的发送流程——基于RTL8139网卡(2)netif_tx_stop_queue之__QUEUE_STATE_XOFF

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6