分类: C/C++
2010-06-21 17:27:59
这个其实就是我们最早开始看TBB时的一个例子。
我们看到这里面多了好几个陌生的东西:
OK,我们一个个来看,先说blocked_range
#ifndef __TBB_blocked_range_H
#define __TBB_blocked_range_H
#include "tbb_stddef.h"
namespace tbb {
/** \page range_req Requirements on range concept
Class \c R implementing the concept of range must define:
- \code R::R( const R& ); \endcode Copy constructor
- \code R::~R(); \endcode Destructor
- \code bool R::is_divisible() const; \endcode True if range can be partitioned into two subranges
- \code bool R::empty() const; \endcode True if range is empty
- \code R::R( R& r, split ); \endcode Split range \c r into two subranges.
**/
//! A range over which to iterate.
/** @ingroup algorithms */
template
class blocked_range {
public:
//! Type of a value
/** Called a const_iterator for sake of algorithms that need to treat a blocked_range
as an STL container. */
typedef Value const_iterator;
//! Type for size of a range
typedef std::size_t size_type;
//! Construct range with default-constructed values for begin and end.
/** Requires that Value have a default constructor. */
blocked_range() : my_begin(), my_end() {}
//! Construct range over half-open interval [begin,end), with the given grainsize.
blocked_range( Value begin_, Value end_, size_type grainsize_=1 ) :
my_end(end_), my_begin(begin_), my_grainsize(grainsize_)
{
__TBB_ASSERT( my_grainsize>0, "grainsize must be positive" );
}
//! Beginning of range.
const_iterator begin() const {return my_begin;}
//! One past last value in range.
const_iterator end() const {return my_end;}
//! Size of the range
/** Unspecified if end()
__TBB_ASSERT( !(end()
}
//! The grain size for this range.
size_type grainsize() const {return my_grainsize;}
//------------------------------------------------------------------------
// Methods that implement Range concept
//------------------------------------------------------------------------
//! True if range is empty. //! True if range is divisible. //! Split range. private: //! Auxilary function used by forking constructor. template template } // namespace tbb #endif /* __TBB_blocked_range_H */ 从代码里可以看到blocked_range有3个constructor,一个不接收参数,一个处理split(split的概念后面再讲),而我们示例里用到的是: blocked_range( Value begin_, Value end_, size_type grainsize_=1 ) : 第三个参数,grainsize,表示的是一个“合适的大小”块,这个块会在一个循环中进行处理,如果数组比这个grainsize还大,parallel_for会把它分割为独立的block,然后分别进行调度(有可能由多个线程进行处理)。 这样我们知道,grainsize其实决定了TBB什么时候对数据进行划分,如果我们把grainsize指定得太小,那就可能会导致产生过多得block,从而使得不同block间的overhead增加(比如多个线程间切换的代价),有可能会使性能下降。相反,如果grainsize设得太大,以致于这个数组几乎没有被划分,那又会导致不能发挥parallel_for期望达到的并行效果,也没有达到理想得性能。 所以我们在决定grainsize时需要小心,最好是能够经过调整测试后得到的值,当然你也可以如本例中一样不指定,让TBB帮你来决定合适的值(一般不是最优的)。 一个调整grainsize的经验性步骤: 1首先把grainsize设得比预想的要大一些,通常设为10000 partitioner TBB里提供了两个partitioner,一个是我们用到的simple_partitioner,另一个是auto_partitioner。 simple_partitioner是parallel_for(以及后面会讲到的parallel_reduce,parallel_scan)的缺省partitioner。simple_partitioner有如下特性: 1 概念简单 另一个partitioner:auto_partitioner,它依赖于一定的规则自动决定划分,在线程间负载均衡和线程切换代价间寻找一个平衡,当然普适的一般就不是对于所有都是最好的~~~ 如果我们想用auto_partitioner,那只要把示例里的simple_partitioner替换一下即可,要注意的是,既然auto_partitioner是自动决定分割的,那指定的grainsize就没有太大意义了。 一般情况下,建议使用auto_partitioner,除非你有足够的时间和精力去优化出一个比较好的grainsize~~~ 最后,终于看到我们最关键的主题了:parallel_for,这是一个算法,类似于STL里的sort、for_each等。 直接步入主题,我们来看parallel_for的源码吧,在tbb/parallel_for.h中: #include "task.h" namespace tbb { //! @cond INTERNAL //! Task type used in parallel_for //! Constructor for root task. template /** \page parallel_for_body_req Requirements on parallel_for body /** \name parallel_for //! Parallel iteration over range with simple partitioner, or default partitioner if no partitioner is specified. //! Parallel iteration over range with auto_partitioner. //! Parallel iteration over range with affinity_partitioner. #if __TBB_EXCEPTIONS //! Parallel iteration over range with auto_partitioner and user-supplied context. //! Parallel iteration over range with affinity_partitioner and user-supplied context. } // namespace tbb #endif /* __TBB_parallel_for_H */ 我们看到,parallel_for有两个版本,一个接收两个参数,一个接收三个参数: range:指定划分block的范围
bool empty() const {return !(my_begin
/** Unspecified if end()
/** The new Range *this has the second half, the old range r has the first half.
Unspecified if end()
my_end(r.my_end),
my_begin(do_split(r)),
my_grainsize(r.my_grainsize)
{}
/** NOTE: my_end MUST be declared before my_begin, otherwise the forking constructor will break. */
Value my_end;
Value my_begin;
size_type my_grainsize;
/** Using this function lets us not require that Value support assignment or default construction. */
static Value do_split( blocked_range& r ) {
__TBB_ASSERT( r.is_divisible(), "cannot split blocked_range that is not divisible" );
Value middle = r.my_begin + (r.my_end-r.my_begin)/2u;
r.my_end = middle;
return middle;
}
friend class blocked_range2d;
friend class blocked_range3d;
};
my_end(end_), my_begin(begin_), my_grainsize(grainsize_)
{
__TBB_ASSERT( my_grainsize>0, "grainsize must be positive" );
}
第一个参数表示起始,第二个参数表示结束,它们的类型为const_iterator,表示的区间为[begin,end)这样一个半开区间。
2在单处理机机器上运行,得到性能数据
3把grainsize减半,看性能降低多少,如果降低在5%-10%之间,那这个grainsize就已经是一个不错的设定了
看完blocked_range,再来看跟它很关联的另一个概念partitioner,顾名思义,partitioner就是指示怎么进行划分block的东东。在示例的parallel_for调用中,第3个参数就指定了一个partitioner,这里我们使用的是simple_partitioner。
2 确保分割不会超过grainsize大小,这样你可以假定operator()的最大范围不会超过grainsize
3它可以针对特定机器调节
simple_partitioner的缺点在于它需要你确定出一个合适的grainsize,而合适的grainsize并不是那么容易得到的。parallel_for
#ifndef __TBB_parallel_for_H
#define __TBB_parallel_for_H
#include "partitioner.h"
#include
namespace internal {
/** @ingroup algorithms */
template
class start_for: public task {
Range my_range;
const Body my_body;
typename Partitioner::partition_type my_partition;
/*override*/ task* execute();
start_for( const Range& range, const Body& body, Partitioner& partitioner ) :
my_range(range),
my_body(body),
my_partition(partitioner)
{
}
//! Splitting constructor used to generate children.
/** this becomes left child. Newly constructed object is right child. */
start_for( start_for& parent, split ) :
my_range(parent.my_range,split()),
my_body(parent.my_body),
my_partition(parent.my_partition,split())
{
my_partition.set_affinity(*this);
}
//! Update affinity info, if any.
/*override*/ void note_affinity( affinity_id id ) {
my_partition.note_affinity( id );
}
public:
static void run( const Range& range, const Body& body, const Partitioner& partitioner ) {
if( !range.empty() ) {
#if !__TBB_EXCEPTIONS || TBB_JOIN_OUTER_TASK_GROUP
start_for& a = *new(task::allocate_root()) start_for(range,body,const_cast
#else
// Bound context prevents exceptions from body to affect nesting or sibling algorithms,
// and allows users to handle exceptions safely by wrapping parallel_for in the try-block.
task_group_context context;
start_for& a = *new(task::allocate_root(context)) start_for(range,body,const_cast
#endif /* __TBB_EXCEPTIONS && !TBB_JOIN_OUTER_TASK_GROUP */
task::spawn_root_and_wait(a);
}
}
#if __TBB_EXCEPTIONS
static void run( const Range& range, const Body& body, const Partitioner& partitioner, task_group_context& context ) {
if( !range.empty() ) {
start_for& a = *new(task::allocate_root(context)) start_for(range,body,const_cast
task::spawn_root_and_wait(a);
}
}
#endif /* __TBB_EXCEPTIONS */
};
task* start_for
if( !my_range.is_divisible() || my_partition.should_execute_range(*this) ) {
my_body( my_range );
return my_partition.continue_after_execute_range(*this);
} else {
empty_task& c = *new( this->allocate_continuation() ) empty_task;
recycle_as_child_of(c);
c.set_ref_count(2);
bool delay = my_partition.decide_whether_to_delay();
start_for& b = *new( c.allocate_child() ) start_for(*this,split());
my_partition.spawn_or_delay(delay,*this,b);
return this;
}
}
} // namespace internal
//! @endcond
// Requirements on Range concept are documented in blocked_range.h
Class \c Body implementing the concept of parallel_for body must define:
- \code Body::Body( const Body& ); \endcode Copy constructor
- \code Body::~Body(); \endcode Destructor
- \code void Body::operator()( Range& r ) const; \endcode Function call operator applying the body to range \c r.
**/
See also requirements on \ref range_req "Range" and \ref parallel_for_body_req "parallel_for Body". **/
//@{
/** @ingroup algorithms **/
template
void parallel_for( const Range& range, const Body& body, const simple_partitioner& partitioner=simple_partitioner() ) {
internal::start_for
}
/** @ingroup algorithms **/
template
void parallel_for( const Range& range, const Body& body, const auto_partitioner& partitioner ) {
internal::start_for
}
/** @ingroup algorithms **/
template
void parallel_for( const Range& range, const Body& body, affinity_partitioner& partitioner ) {
internal::start_for
}
//! Parallel iteration over range with simple partitioner and user-supplied context.
/** @ingroup algorithms **/
template
void parallel_for( const Range& range, const Body& body, const simple_partitioner& partitioner, task_group_context& context ) {
internal::start_for
}
/** @ingroup algorithms **/
template
void parallel_for( const Range& range, const Body& body, const auto_partitioner& partitioner, task_group_context& context ) {
internal::start_for
}
/** @ingroup algorithms **/
template
void parallel_for( const Range& range, const Body& body, affinity_partitioner& partitioner, task_group_context& context ) {
internal::start_for
}
#endif /* __TBB_EXCEPTIONS */
//@}
body:指定对block应用的操作,Body可以看成是一个操作子functor,它的operator(...)会以blocked_range为参数进行调用,当然如果我们传过来的是一个函数指针也是可以的,只要它能以blocked_range为参数进行调用
partitioner:指定划分器,可选的两种simple_partitioner和auto_partitioner
其实从parallel_for的prototype declaration和definition中我们可以明显地看到generic programming的意思,这里Range、Body、Partitioner其实都是GP里的concept,它们要求满足一定的policy,因此是典型的基于policy的design,当然这里的policy比起STL,有过之而无不及了。