In-place matrix transposition-djkpengjun-ChinaUnix博客

没有代码的日子会死djkpengjun.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

djkpengjun

博客访问： 1609391
博文数量： 399
博客积分： 8508
博客等级：中将
技术积分： 5302
用户组：普通用户
注册时间： 2009-10-14 09:28

个人简介

能力强的人善于解决问题，有智慧的人善于绕过问题。区别很微妙，小心谨慎做后者。

文章分类

全部博文（399）

Kotlin（0）
Archtecture（71）

数据库（1）

Kafka（1）

Domain_Driven_De（2）

搜索（1）

Linux Swiss（1）

编程（5）

Scrum（1）

前端架构（5）

MongoDB（3）

项目架构（10）

Node.js（1）

Angular（2）

AOP（1）

Guava（1）

Web Crawler（2）

Play（8）

高并发（15）

Load_Balance（3）

Hadoop（4）

REST（4）
金融IT常识（1）
信息压缩理论（1）
EMC的日子（16）

Shell Ahead（2）
简历（1）
wingdb调试（1）
职业规划（4）
养生（1）
分布式（4）
五险一金（1）
linux内核研究（15）
人际交往（2）
算法导论（1）
VS2005（0）
概率（21）
google（1）
百度分享（1）
跳槽必看（9）
智力题（7）
SHELL 脚本（2）
大规模数据处理（6）
POJ（16）
wince（1）
笔试面试（28）
ACM（17）
操作系统（10）
网络（14）
算法（55）

国际大学ACM程序（0）

国际大学ACM程序（15）

ACM程序设计培训（23）
数据结构（11）
c++（45）
嵌入式（19）
未分配的博文（17）

文章存档

2018年（3）

2017年（1）

2016年（1）

2015年（69）

2013年（14）

2012年（17）

2011年（12）

2010年（189）

2009年（93）

我的朋友

[] Properties of the permutation

In the following, we assume that the N×M matrix is stored in row-major order with zero-based indices. This means that the (n,m) element, for n = 0,…,N−1 and m = 0,…,M−1, is stored at an address a = Mn + m (plus some offset in memory, which we ignore). In the transposed M×N matrix, the corresponding (m,n) element is stored at the address a' = Nm + n, again in row-major order. We define the transposition permutation to be the function a' = P(a) such that:

for all

This defines a permutation on the numbers .

It turns out that one can define simple formulas for P and its inverse (Cate & Twigg, 1977). First:

where "mod" is the . Proof: if 0 ≤ a = Mn + m < MN − 1, then Na mod (MN−1) = MN n + Nm mod (MN − 1) = n + Nm. [Note that MN x mod (MN−1) = (MN − 1) x + x mod (MN−1) = x for 0 ≤ x < MN − 1.] Note that the first (a = 0) and last (a = MN−1) elements are always left invariant under transposition. Second, the inverse permutation is given by:

(This is just a consequence of the fact that the inverse of an N×M transpose is an M×N transpose, although it is also easy to show explicitly that P⁻¹ composed with P gives the identity.)

As proved by Cate & Twigg (1977), the number of (cycles of length 1) of the permutation is precisely 1 + gcd(N−1,M−1), where gcd is the . For example, with N = M the number of fixed points is simply N (the diagonal of the matrix). If N − 1 and M − 1 are , on the other hand, the only two fixed points are the upper-left and lower-right corners of the matrix.

The number of cycles of any length k>1 is given by (Cate & Twigg, 1977):

where μ is the and the sum is over the d of k.

Furthermore, the cycle containing a=1 (i.e. the second element of the first row of the matrix) is always a cycle of maximum length L, and the lengths k of all other cycles must be divisors of L (Cate & Twigg, 1977).

For a given cycle C, every element has the same greatest common divisor $d = gcd(x, M N - 1)$ . Proof (Brenner, 1973): Let s be the smallest element of the cycle, and $d = gcd(s, M N - 1)$ . From the definition of the permutation P above, every other element x of the cycle is obtained by repeatedly multiplying s by N modulo MN−1, and therefore every other element is divisible by d. But, since N and MN − 1 are coprime, x cannot be divisible by any factor of MN − 1 larger than d, and hence $d = gcd(x, M N - 1)$ . This theorem is useful in searching for cycles of the permutation, since an efficient search can look only at multiples of divisors of MN−1 (Brenner, 1973).

Laflin & Brebner (1970) pointed out that the cycles often come in pairs, which is exploited by several algorithms that permute pairs of cycles at a time. In particular, let s be the smallest element of some cycle C of length k. It follows that MN−1−s is also an element of a cycle of length k (possibly the same cycle). Proof: by the definition of P above, the length k of the cycle containing s is the smallest k > 0 such that . Clearly, this is the same as the smallest k>0 such that , since we are just multiplying both sides by −1, and .

[] Algorithms

The following briefly summarizes the published algorithms to perform in-place matrix transposition. implementing some of these algorithms can be found in the references, below.

[] Square matrices

For a square N×N matrix A_n,m = A(n,m), in-place transposition is easy because all of the cycles have length 1 (the diagonals A_n,n) or length 2 (the upper triangle is swapped with the lower triangle. to accomplish this (assuming zero-based indices) is:

for n = 0 to N - 2
    for m = n + 1 to N - 1
        swap A(n,m) with A(m,n)

This type of implementation, while simple, can exhibit poor performance due to poor cache-line utilization, especially when N is a (due to cache-line conflicts in a with limited associativity). The reason for this is that, as m is incremented in the inner loop, the memory address corresponding to A(n,m) or A(m,n) jumps discontiguously by N in memory (depending on whether the array is in column-major or row-major format, respectively). That is, the algorithm does not exploit the possibility of .

One solution to improve the cache utilization is to "block" the algorithm to operate on several numbers at once, in blocks given by the cache-line size; unfortunately, this means that the algorithm depends on the size of the cache line (it is "cache-aware"), and on a modern computer with multiple levels of cache it requires multiple levels of machine-dependent blocking. Instead, it has been suggested (Frigo et al., 1999) that better performance can be obtained by a recursive algorithm: divide the matrix into four submatrices of roughly equal size, transposing the two submatrices along the diagonal recursively and transposing and swapping the two submatrices above and below the diagonal. (When N is sufficiently small, the simple algorithm above is used as a base case, as naively recursing all the way down to N=1 would have excessive function-call overhead.) This is a algorithm, in the sense that it can exploit the cache line without the cache-line size being an explicit parameter.

[] Non-square matrices: Following the cycles

For non-square matrices, the algorithms are more complicated. Many of the algorithms prior to 1980 could be described as "follow-the-cycles" algorithms. That is, they loop over the cycles, moving the data from one location to the next in the cycle. In pseudocode form:

for each length>1 cycle C of the permutation
    pick a starting address s in C
    let D = data at s
    let x = predecessor of s in the cycle
    while x ≠ s
        move data from x to successor of x
        let x = predecessor of x
    move data from D to successor of s

The differences between the algorithms lie mainly in how they locate the cycles, how they find the starting addresses in each cycle, and how they ensure that each cycle is moved exactly once. Typically, as discussed above, the cycles are moved in pairs, since s and MN−1−s are in cycles of the same length (possibly the same cycle). Sometimes, a small scratch array, typically of length M+N (e.g. Brenner, 1973; Cate & Twigg, 1977) is used to keep track of a subset of locations in the array that have been visited, to accelerate the algorithm.

In order to determine whether a given cycle has been moved already, the simplest scheme would be to use O(MN) auxiliary storage, one per element, to indicate whether a given element has been moved. To use only O(M+N) or even O(log MN) auxiliary storage, more complicated algorithms are required, and the known algorithms have a worst-case computational cost of O(MN log MN) at best, as first proved by (Fich et al., 1995; Gustavson & Swirszcz, 2007).

Such algorithms are designed to move each data element exactly once. However, they also involve a considerable amount of arithmetic to compute the cycles, and require heavily non-consecutive memory accesses since the adjacent elements of the cycles differ by multiplicative factors of N, as discussed above.

阅读(1300) | 评论(0) | 转发(0) |

上一篇：元素选择问题总结

下一篇：旋转卡壳算法（1）——凸包直径

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6