利用ARM v6的SIMD指令实现blend算法-loughsky-ChinaUnix博客

飞翔,嵌入式linux性能优化

首页　| 　博文目录　| 　关于我

loughsky

博客访问： 3162734
博文数量： 117
博客积分： 10003
博客等级：上将
技术积分： 5405
用户组：普通用户
注册时间： 2007-01-23 09:34

文章分类

全部博文（117）

OpenGL（4）
图形系统（22）
硬件（9）
Android（1）
邮件服务器（1）
License（0）
VC（0）
软件工程（0）
linux基础知识（0）
个人观点（0）
程序设计（2）
Linux平台技术分（2）
底层调试技术（0）
嵌入式平台（6）

FLASH（0）

gcc（4）

glibc（1）

交叉编译工具链（0）

BootLoader（0）
关注性能（59）
内存管理（0）
未分配的博文（11）

文章存档

2011年（1）

2010年（10）

2009年（69）

2008年（37）

我的朋友

相关博文

利用ARM v6的SIMD指令实现blend算法

分类： LINUX

2009-06-05 08:55:12

ARM v6的SIMD指令相对较弱，我原先以为其不能实现负责的blend算法，可我在cario的pixman库中看到了，写的很精巧。

这段代码的问题在于，对于a=0,255时没有做专门的优化，更加通用一些。如果在做blend时，a值大量为0或255时，性能会稍弱一些。

fbCompositeSrc_8888x8888arm (pixman_op_t op,
    pixman_image_t * pSrc,
    pixman_image_t * pMask,
    pixman_image_t * pDst,
    int16_t      xSrc,
    int16_t      ySrc,
    int16_t      xMask,
    int16_t      yMask,
    int16_t      xDst,
    int16_t      yDst,
    uint16_t     width,
    uint16_t     height)
{
    uint32_t *dstLine, *dst;
    uint32_t *srcLine, *src;
    int dstStride, srcStride;
    uint16_t w;
    uint32_t component_half = 0x800080;
    uint32_t upper_component_mask = 0xff00ff00;
    uint32_t alpha_mask = 0xff;

fbComposeGetStart (pDst, xDst, yDst, uint32_t, dstStride, dstLine, 1);
fbComposeGetStart (pSrc, xSrc, ySrc, uint32_t, srcStride, srcLine, 1);

while (height--)
{
dst = dstLine;
dstLine += dstStride;
src = srcLine;
srcLine += srcStride;
w = width;

//#define inner_branch
asm volatile (
   "cmp %[w], #0\n\t"
   "beq 2f\n\t"
   "1:\n\t"
   /* load src */
   "ldr r5, [%[src]], #4\n\t"
#ifdef inner_branch
   /* We can avoid doing the multiplication in two cases: 0x0 or 0xff.
    * The 0x0 case also allows us to avoid doing an unecessary data
    * write which is more valuable so we only check for that */
   "cmp r5, #0\n\t"
   "beq 3f\n\t"

/* = 255 - alpha */
"sub r8, %[alpha_mask], r5, lsr #24\n\t"

"ldr r4, [%[dest]] \n\t"

#else
"ldr r4, [%[dest]] \n\t"

   /* = 255 - alpha */
   "sub r8, %[alpha_mask], r5, lsr #24\n\t"
#endif
   "uxtb16 r6, r4\n\t"
   "uxtb16 r7, r4, ror #8\n\t"

   /* multiply by 257 and divide by 65536 */
   "mla r6, r6, r8, %[component_half]\n\t"
   "mla r7, r7, r8, %[component_half]\n\t"

"uxtab16 r6, r6, r6, ror #8\n\t"
"uxtab16 r7, r7, r7, ror #8\n\t"

   /* recombine the 0xff00ff00 bytes of r6 and r7 */
   "and r7, r7, %[upper_component_mask]\n\t"
   "uxtab16 r6, r7, r6, ror #8\n\t"

"uqadd8 r5, r6, r5\n\t"

#ifdef inner_branch
"3:\n\t"

#endif
   "str r5, [%[dest]], #4\n\t"
   /* increment counter and jmp to top */
   "subs %[w], %[w], #1\n\t"
   "bne 1b\n\t"
   "2:\n\t"
   : [w] "+r" (w), [dest] "+r" (dst), [src] "+r" (src)
   : [component_half] "r" (component_half), [upper_component_mask] "r" (upper_component_mask),
     [alpha_mask] "r" (alpha_mask)
   : "r4", "r5", "r6", "r7", "r8", "cc", "memory"
   );
    }
}

阅读(2008) | 评论(0) | 转发(0) |

上一篇：论嵌入式Linux下窗口系统的性能

下一篇：drawing faster with Cairo by using masks

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6