使用SIMD指令高度优化Matrix类-loughsky-ChinaUnix博客

飞翔,嵌入式linux性能优化

首页　| 　博文目录　| 　关于我

loughsky

博客访问： 3162623
博文数量： 117
博客积分： 10003
博客等级：上将
技术积分： 5405
用户组：普通用户
注册时间： 2007-01-23 09:34

文章分类

全部博文（117）

OpenGL（4）
图形系统（22）
硬件（9）
Android（1）
邮件服务器（1）
License（0）
VC（0）
软件工程（0）
linux基础知识（0）
个人观点（0）
程序设计（2）
Linux平台技术分（2）
底层调试技术（0）
嵌入式平台（6）

FLASH（0）

gcc（4）

glibc（1）

交叉编译工具链（0）

BootLoader（0）
关注性能（59）
内存管理（0）
未分配的博文（11）

文章存档

2011年（1）

2010年（10）

2009年（69）

2008年（37）

我的朋友

相关博文

使用SIMD指令高度优化Matrix类

分类： LINUX

2008-04-14 16:25:44

图形学中最常用的底层类为矢量类(Vector)和矩阵类(Matrix).已经存在很多实现的版本，甚至包括用汇编语言写的内联函数库版本。但这些是否是最优化的类呢？下面介绍的矩阵类版本利用了SIMD指令集优化技术实现了数据对齐与并行处理，极大地提高了矩阵操作速度，甚至比微软d3dmatrix.h中的类速度快上两倍，比用内联汇编技术编写的类快上一倍多。

Introduction

This article is about one of the most common elements in 3D graphics:3D transformation matrices and vectors. As we all know, 3D graphics uses [x,y,z] or [x,y,z,w] vectors to represent points in the three-dimensional space or in the homogeneous space, respectively. You can manipulate these vectors by multiplying them with 4x4 transformation matrices. The standard manipulation matrices are translation, rotation and scaling.

For example, to rotate a vector by 90° around the X axis and then move it 20 units toward the positive infinity of the Y axis, create a rotation matrix and a translation matrix, and multiply the vector with both of them. It is simpler and faster to multiply both the matrices in advance. Later, you can multiply the vector with the product when needed.

However, standard multiplication of one matrix by another takes 64 products and 48 sums. A standard multiplication of a vector by a matrix takes 16 products and 12 sums.

This article presents a better way to manipulate vectors and matrices using Intel’s Streaming SIMD Extensions, to enable the calculation of 4 products or 4 additions at once. The SIMD register can hold 4 floating-point numbers with single precision. Hence one register can hold a full vector, and four registers can hold a full 4x4 matrix.

The Library

Using the Streaming SIMD Extensions, you can complete a matrix multiplication with only 16 products and 12 additions. The library provided in this article was written with the goal to get the most out of the Streaming SIMD Extensions, and to reduce the amount of time needed for matrix/vector operations. Most of the library is written in C++, except for one small section in assembly. The library’s functions are about twice as fast as the equivalent scalar version of those functions.The library includes three classes:

GPMatrix class

The GPMatrix class is a 4x4 matrix of floats.

class GPMatrix {
    union {
       struct {
           __m128 _L1, _L2, _L3, _L4;
       };
       struct {
           float _11, _12, _13, _14;
           float _21, _22, _23, _24;
           float _31, _32, _33, _34;
           float _41, _42, _43, _44;
       };
    };
    // ...
};

Figure. 1 The GPMatrix Class

The class contains 16 float elements (_11 to _44). The elements are placed in four lines, where each line is represented as one SIMD variable (_L1 to _L1).
Data elements can be referenced by their row and column: Mat[0][1] or Mat(0,1).

There is one limitation for the use of this class:
The class must be 16-byte aligned, so it won’t split between too many cache lines, and to enable faster reading/writing of the class’ elements. However, the class is aligned automatically by the .

The GPVector class

The GPVector class is a vector of 4 floats.

class GPVector {
    union {
        __m128 vec;
        struct {
            float x,y,z,w;
        };
    };
    // ...
};

Figure 2. The GPVector Class

The GPVector class has the x,y,z and w elements of the vector as floats, and also represented as one SIMD variable. As with the GPMatrix class, the GPVector class must be 16-bytes aligned.

A variant of the GPVector class is the GPVector3 class, which does not have the w element. This class holds "pure" 3D vectors. However, for alignment and for other reasons, the w element, which is not used, is replaced with a spacer.

Constructors & Operators

Operators on GPMatrix:

A * B matrices multiplication

A ± B matrices addition/subtraction

±A matrix unary minus/plus

A * s matrix multiplication with scalar

A *= B matrix multiplied by matrix

A *= s matrix multiplied by scalar

A ±= B matrix added/subtracted by matrix

matrix transpose

matrix inverse

matrix determinant

matrix min/max element

GPMatrix Constructors:

Identity matrix

Zero matrix

Rotation matrices (around the X axis, Y axis and Z axis)

Translation matrices

Scaling matrices

Operators on GPVector:

v * M vector multiplication with matrix

v * s vector multiplication with scalar

v * w vectors dot (inner) product

v % w vectors cross product (in 3D)

v ± w vectors addition/subtraction

± v vector unary minus/plus

~ v vector normalization

v *= M vector multiplied by matrix

v *= s vector multiplied by scalar

v ±= w vector added/subtracted by vector

Examples

This section provides some examples of Streaming SIMD Extensions usage, compared with equivalent scalar code.

Using SIMD code
First, lets take a look on one of the most obvious and fundamental operator in the library – multiplying a vector by a matrix.
Scalar Code

The scalar and standard way is to calculate each element of the destination vector by multiplying the source vector with the appropriate column of the matrix. The computation takes 16 multiplications and 12 sums.

void scalarVectorMult (const GPVector &Vec, const GPMatrix &Mat, GPVector &Res) {
    Res.x = Vec.x*Mat._11 + Vec.y*Mat._21 + Vec.z*Mat._31 + Vec.w*Mat._41;
    Res.y = Vec.x*Mat._12 + Vec.y*Mat._22 + Vec.z*Mat._32 + Vec.w*Mat._42;
    Res.z = Vec.x*Mat._13 + Vec.y*Mat._23 + Vec.z*Mat._33 + Vec.w*Mat._43;
    Res.w = Vec.x*Mat._14 + Vec.y*Mat._24 + Vec.z*Mat._34 + Vec.w*Mat._44;
}

Figure 3. Vector Multiplication – Scalar code

SIMD Code

void VectorMult(const GPVector &Vec, const GPMatrix &Mat, GPVector &Res) {
    F32vec4 Result;
    Result = F32vec4(Vec.x) * Mat._L1;
    Result += F32vec4(Vec.y) * Mat._L2;
    Result += F32vec4(Vec.z) * Mat._L3;
    Result += F32vec4(Vec.w) * Mat._L4;
    Res = Result;
}

Figure 4. Vector Multiplication – SIMDified code

If we take a closer look in the scalar multiplication process, we can see that we can calculate the whole vector at once: In the scalar code, Vec.x is multiplied with the first four elements of the matrix. Those four elements are represented as the first line of the matrix, and are already placed in one SIMD variable. So we only need to expand the X element of the vector and multiply it with the first line of the matrix. This is done in the first assignment (third line) of the SIMD code. Next we multiply the expanded Y, Z and W elements of the vector with the second, third and forth line of the matrix respectively, as for the first element. Note that the X element of the result vector is the first element in the sum of the four vectors calculated before, the Y element is the second and so on. Therefore, the final result is just the sum of all the four vectors, and you don’t even need to rearrange the results.
This equivalent SIMD code takes only 4 multiplications and 3 sums. Since one SIMD calculation takes about half the time of four scalar calculations, the SIMD code runs more than twice as fast as the scalar code.
Some Numbers
This table shows the performance gain of using the Matrix Library, compared to Microsoft^*‘s D3DXMATRIX class from D3DXMath.H of DirectX^*7, which implements the same functions the scalar way.
Please note that this is a synthetic test, so those ratios may change when measured in application.

Function Scalar Code Matrix Library ratio

Vector Multiplication   60^I 26 2.30

Matrix Multiplication 282^I 87 3.26

Inverse Matrix 328 170 1.92

Make Rotation Matrix 169 143^II 1.18

I The scalar version of the function is not inlined, so accurate numbers should be ~10 cycles less.
II There is another version of this function that uses , and takes only 112 cycles.
And Finally – A Real Example

The last example presents code for calculating an exponent of a matrix.An exponent of a real number can be calculated using Taylor Series. In the same way, an exponent of matrix is defined:

There are three versions for calculating the series:

Using scalar code.
Using the GPMatrix class.
All the functions as inlined assembly (with no cross optimizations between the functions).

The first and second versions are written using the classes’ operands. Actually I wrote them in about ten minutes.
The third version, which was written using inlined assembly, took much more time to write – about an hour or two. In addition, this version is not fully optimized – the optimized functions are just placed end to end, without cross procedural optimizations.

The first two versions are quite readable and can be modified easily. The third version is written in assembly, and is therefore much harder to modify. See the source file Exponent.cpp which is part of the file.
The next table shows the average time an iteration takes for each version:

Version Average Time

Scalar code 371

GPMatrix code 130

Inlined assembly 144

Note that using the GPMatrix instead of scalar code gives an improvement of x2.85! Even if the scalar functions were fully inlined, the GPMatrix code would still be more than twice as fast.

The results of the GPMatrix version are even better than the assembly version since the compiler did a better job of optimizing the inlined functions… However, even a fully optimized assembly code (i.e. hand optimizations between functions) won’t give much better results.
Conclusion
The source files of the library and of the last example can be found in .
This library demonstrates how a performance library can be written without hardly using any assembly. Usually, writing good assembly is more efficient than C/C++. In this case writing the library functions without inlined assembly allows the compiler to perform inter-procedural optimizations on your code.
Links
Calculating rotation matrix using fast approximation, is done with the sine function from the Approximate Math (AM) library, download it from http://developer.intel.com/design/pentiumiii/devtools/.
Other Resources
Two related application notes:
AP-928 - Inverse of 4x4 Matrix
AP-930 - Matrix Multiplication
Some Useful Links:
Download an evaluation copy of Intel C/C++ Compiler at http://developer.intel.com/vtune/compilers/cpp/demo.htm.
Vtune Analyzer can be used to measure the time consumed by function(s), and more. Download an evaluation copy at http://developer.intel.com/vtune/.
Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.

阅读(2286) | 评论(0) | 转发(0) |

上一篇：GCC-4.1.0发布

下一篇：基于TMS320C6201的G.723.1多通道语音编解码的实现

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6

A * B	matrices multiplication
A ± B	matrices addition/subtraction
±A	matrix unary minus/plus
A * s	matrix multiplication with scalar
A *= B	matrix multiplied by matrix
A *= s	matrix multiplied by scalar
A ±= B	matrix added/subtracted by matrix
	matrix transpose
	matrix inverse
	matrix determinant
	matrix min/max element

	Identity matrix
	Zero matrix
	Rotation matrices (around the X axis, Y axis and Z axis)
	Translation matrices
	Scaling matrices

v * M	vector multiplication with matrix
v * s	vector multiplication with scalar
v * w	vectors dot (inner) product
v % w	vectors cross product (in 3D)
v ± w	vectors addition/subtraction
± v	vector unary minus/plus
~ v	vector normalization
v *= M	vector multiplied by matrix
v *= s	vector multiplied by scalar
v ±= w	vector added/subtracted by vector

Function	Scalar Code	Matrix Library	ratio
Vector Multiplication	60^I	26	2.30
Matrix Multiplication	282^I	87	3.26
Inverse Matrix	328	170	1.92
Make Rotation Matrix	169	143^II	1.18

Version	Average Time
Scalar code	371
GPMatrix code	130
Inlined assembly	144