Chinaunix首页 | 论坛 | 博客
  • 博客访问: 848171
  • 博文数量: 756
  • 博客积分: 40000
  • 博客等级: 大将
  • 技术积分: 4980
  • 用 户 组: 普通用户
  • 注册时间: 2008-10-13 14:40
文章分类

全部博文(756)

文章存档

2011年(1)

2008年(755)

我的朋友

分类:

2008-10-13 16:12:03


This time I got results with out-of-cache data. To eliminate cache
effect, I used 1MB source data and 1MB and destination data, and repeated
memcpy*()'s 1MB / datasize times in inner loop, and 1024 times in outer
loop. Total data size was the same 1GB as previous, but the results
were quite different than those with in-cache data.

In this test, non-temporal movntq instruction was obviously a big win.
Since it doesn't pollute cache lines, you can get 2x performance for
copying data not in cache.

Also, I found that my MMX-optimized i686_copyin() is faster than plain
old memcpy for data > 2~3 KB. It seems that saving/restoring FP state in/
from stack is quite expensive for small data copying (it needs 108 bytes
of memcpying from processor to memory plus some overhead).

I'll come up with finalized i686_copyin/out() soon.

Jun-Young

--
Bang Jun-Young

--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.uncached.txt"

addr1=0x804c000 addr2=0x814c000

memcpy 64B -- 16384 loops
  aligned blocks
      libc memcpy                                        2.893993 s
      rep movsw                                          2.859771 s
      asm loop                                           2.669005 s
      i686_copyin                                        2.910439 s
      i686_copyin2                                       2.885610 s
      MMX memcpy using MOVQ                              2.675665 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.949940 s
      with simple MOVUSB (no prefetch)                   2.719580 s
      arjanv's MOVQ (with prefetch)                      2.938366 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.552954 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.545507 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.723010 s
      MMX memcpy using MOVQ                              2.893861 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.093558 s
      with simple MOVUSB (no prefetch)                   2.973506 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        3.125790 s
      MMX memcpy using MOVQ                              2.661766 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.740727 s
      with simple MOVUSB (no prefetch)                   2.715262 s

addr1=0x804c000 addr2=0x814c000
memcpy 1024B -- 1024 loops
  aligned blocks
      libc memcpy                                        2.761827 s
      rep movsw                                          2.764354 s
      asm loop                                           2.820187 s
      i686_copyin                                        2.647857 s
      i686_copyin2                                       2.647648 s
      MMX memcpy using MOVQ                              2.574933 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.870815 s
      with simple MOVUSB (no prefetch)                   2.684049 s
      arjanv's MOVQ (with prefetch)                      2.518789 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.588186 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.698439 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.800100 s
      MMX memcpy using MOVQ                              2.588999 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.852392 s
      with simple MOVUSB (no prefetch)                   2.723908 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.749374 s
      MMX memcpy using MOVQ                              2.683349 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.203756 s
      with simple MOVUSB (no prefetch)                   2.750306 s

addr1=0x804c000 addr2=0x814c000
memcpy 4kB -- 256 loops
  aligned blocks
      libc memcpy                                        2.758545 s
      rep movsw                                          2.759825 s
      asm loop                                           2.818919 s
      i686_copyin                                        2.633134 s
      i686_copyin2                                       2.641534 s
      MMX memcpy using MOVQ                              2.571201 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.795929 s
      with simple MOVUSB (no prefetch)                   2.681924 s
      arjanv's MOVQ (with prefetch)                      2.512153 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.577637 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.688840 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.828267 s
      MMX memcpy using MOVQ                              2.584795 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.773777 s
      with simple MOVUSB (no prefetch)                   2.691957 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.711029 s
      MMX memcpy using MOVQ                              2.690554 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.047554 s
      with simple MOVUSB (no prefetch)                   2.782641 s

addr1=0x804c000 addr2=0x814c000
memcpy 64kB -- 16 loops
  aligned blocks
      libc memcpy                                        2.764299 s
      rep movsw                                          2.767497 s
      asm loop                                           2.826478 s
      i686_copyin                                        2.626365 s
      i686_copyin2                                       2.625997 s
      MMX memcpy using MOVQ                              2.570352 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767928 s
      with simple MOVUSB (no prefetch)                   2.685339 s
      arjanv's MOVQ (with prefetch)                      2.521904 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.575878 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.682403 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.823552 s
      MMX memcpy using MOVQ                              2.580810 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767096 s
      with simple MOVUSB (no prefetch)                   2.707592 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.713003 s
      MMX memcpy using MOVQ                              2.668149 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.975933 s
      with simple MOVUSB (no prefetch)                   2.779886 s

addr1=0x804c000 addr2=0x814c000
memcpy 128kB -- 8 loops
  aligned blocks
      libc memcpy                                        2.766495 s
      rep movsw                                          2.767812 s
      asm loop                                           2.827207 s
      i686_copyin                                        2.626962 s
      i686_copyin2                                       2.618238 s
      MMX memcpy using MOVQ                              2.570613 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.775084 s
      with simple MOVUSB (no prefetch)                   2.684980 s
      arjanv's MOVQ (with prefetch)                      2.521927 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.575982 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.682593 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.817080 s
      MMX memcpy using MOVQ                              2.588906 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.766316 s
      with simple MOVUSB (no prefetch)                   2.706869 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.711935 s
      MMX memcpy using MOVQ                              2.674179 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.963451 s
      with simple MOVUSB (no prefetch)                   2.780192 s

addr1=0x804c000 addr2=0x814c000
memcpy 256kB -- 4 loops
  aligned blocks
      libc memcpy                                        2.766599 s
      rep movsw                                          2.767784 s
      asm loop                                           2.828783 s
      i686_copyin                                        2.619552 s
      i686_copyin2                                       2.627876 s
      MMX memcpy using MOVQ                              2.571837 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.776927 s
      with simple MOVUSB (no prefetch)                   2.686435 s
      arjanv's MOVQ (with prefetch)                      2.523016 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.577187 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.675317 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.827427 s
      MMX memcpy using MOVQ                              2.590171 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.769825 s
      with simple MOVUSB (no prefetch)                   2.708104 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.710984 s
      MMX memcpy using MOVQ                              2.674800 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.972209 s
      with simple MOVUSB (no prefetch)                   2.787717 s

addr1=0x804c000 addr2=0x814c000
memcpy 512kB -- 2 loops
  aligned blocks
      libc memcpy                                        2.766847 s
      rep movsw                                          2.767707 s
      asm loop                                           2.811354 s
      i686_copyin                                        2.626655 s
      i686_copyin2                                       2.626876 s
      MMX memcpy using MOVQ                              2.571146 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.775052 s
      with simple MOVUSB (no prefetch)                   2.684812 s
      arjanv's MOVQ (with prefetch)                      2.513970 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.576279 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.683077 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.827907 s
      MMX memcpy using MOVQ                              2.589284 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767601 s
      with simple MOVUSB (no prefetch)                   2.706929 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.702820 s
      MMX memcpy using MOVQ                              2.675799 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.969484 s
      with simple MOVUSB (no prefetch)                   2.785175 s


--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.c"

/* -*- c-file-style: "linux" -*- */

/* memcpy speed benchmark using different i86-specific routines.
*
* Framework (C) 2001 by Martin Pool , based on speed.c
* by tridge.
*
* Routines lifted from all kinds of places.
*
* You must not use floating-point code anywhere in this application
* because it scribbles on the FP state and does not reset it.  */


#include
#include
#include
#include

memcpy_rep_movsl(void *to, const void *from, size_t len);
memcpy_words(void *to, const void *from, size_t len);
i686_copyin(void *to, const void *from, size_t len);
i686_copyin2(void *to, const void *from, size_t len);

#define MAX(a,b) ((a)>(b)?(a):(b))
#define MIN(a,b) ((a)<(b)?(a):(b))

#include
struct rusage tp1,tp2;

static void start_timer()
{
getrusage(RUSAGE_SELF,&tp1);
}


static long end_timer()
{
getrusage(RUSAGE_SELF,&tp2);
#if 0
printf ("tp1 = %ld.%05ld, tp2 = %ld.%05ld\n",
(long) tp1.ru_utime.tv_sec, (long) tp1.ru_utime.tv_usec,
(long) tp2.ru_utime.tv_sec, (long) tp2.ru_utime.tv_usec);
#endif

return ((tp2.ru_utime.tv_sec - tp1.ru_utime.tv_sec) * 1000000 +
(tp2.ru_utime.tv_usec - tp1.ru_utime.tv_usec));
}




/*
* By Ingo Molnar and Doug Ledford; hacked up to remove
* kernel-specific stuff like saving/restoring float registers.
*
* */
void *
memcpy_movusb (void *to, const void *from, size_t n)
{
size_t size;

#define STEP 0x20
#define ALIGN 0x10
if ((unsigned long)to & (ALIGN-1)) {
size = ALIGN - ((unsigned long)to & (ALIGN-1));
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from),
     "r" (to));
n -= size;
from += size;
to += size;
}
/*
* If the copy would have tailings, take care of them
* now instead of later
*/
if (n & (ALIGN-1)) {
size = n - ALIGN;
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from + size),
     "r" (to + size));
n &= ~(ALIGN-1);
}
/*
* Prefetch the first two cachelines now.
*/
__asm__ __volatile__("prefetchnta 0x00(%0)\n\t"
     "prefetchnta 0x20(%0)\n\t"
     :
     : "r" (from));
 
while (n >= STEP) {
__asm__ __volatile__(
"movups 0x00(%0),%%xmm0\n\t"
"movups 0x10(%0),%%xmm1\n\t"
"movntps %%xmm0,0x00(%1)\n\t"
"movntps %%xmm1,0x10(%1)\n\t"
:
: "r" (from), "r" (to)
: "memory");
from += STEP;
/*
* Note: Intermixing the prefetch at *exactly* this point
* in time has been shown to be the fastest possible.
* Timing these prefetch instructions is a complete black
* art with nothing but trial and error showing the way.
* To that extent, this optimum version was found by using
* a userland version of this routine that we clocked for
* lots of runs.  We then fiddled with ordering until we
* settled on our highest speen routines.  So, the long
* and short of this is, don't mess with instruction ordering
* here or suffer permance penalties you will.
*/
__asm__ __volatile__(
"prefetchnta 0x20(%0)\n\t"
:
: "r" (from));
to += STEP;
n -= STEP;
}

return to;
}

void *
memcpy_simple_movusb (void *to, const void *from, size_t n)
{
size_t size;

#define STEP 0x20
#define ALIGN 0x10
if ((unsigned long)to & (ALIGN-1)) {
size = ALIGN - ((unsigned long)to & (ALIGN-1));
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from),
     "r" (to));
n -= size;
from += size;
to += size;
}
/*
* If the copy would have tailings, take care of them
* now instead of later
*/
if (n & (ALIGN-1)) {
size = n - ALIGN;
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from + size),
     "r" (to + size));
n &= ~(ALIGN-1);
}

while (n >= STEP) {
__asm__ __volatile__(
"movups 0x00(%0),%%xmm0\n\t"
"movups 0x10(%0),%%xmm1\n\t"
"movups %%xmm0,0x00(%1)\n\t"
"movups %%xmm1,0x10(%1)\n\t"
:
: "r" (from), "r" (to)
: "memory");
from += STEP;
to += STEP;
n -= STEP;
}

return to;
}


/* From Linux 2.4.8.  I think this must be aligned. */
void *
memcpy_mmx (void *to, const void *from, size_t len)
{
int i;

for(i = 0; i < len / 64; i++) {
      __asm__ __volatile__ (
   "movq (%0), %%mm0\n"
   "\tmovq 8(%0), %%mm1\n"
   "\tmovq 16(%0), %%mm2\n"
   "\tmovq 24(%0), %%mm3\n"
   "\tmovq %%mm0, (%1)\n"
   "\tmovq %%mm1, 8(%1)\n"
   "\tmovq %%mm2, 16(%1)\n"
   "\tmovq %%mm3, 24(%1)\n"
   "\tmovq 32(%0), %%mm0\n"
   "\tmovq 40(%0), %%mm1\n"
   "\tmovq 48(%0), %%mm2\n"
   "\tmovq 56(%0), %%mm3\n"
   "\tmovq %%mm0, 32(%1)\n"
   "\tmovq %%mm1, 40(%1)\n"
   "\tmovq %%mm2, 48(%1)\n"
   "\tmovq %%mm3, 56(%1)\n"
   : : "r" (from), "r" (to) : "memory");
from += 64;
to += 64;
}

if (len & 63)
memcpy(to, from, len & 63);

return to;
}

static void print_time (char const *msg,
long long loops,
long t)
{
printf("      %-50s %ld.%06ld s\n", msg, t/1000000,
       t % 1000000);
}

void *
memcpy_arjanv (void *to, const void *from, size_t len)
{
int i;

__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
"   prefetchnta 64(%0)\n"
"   prefetchnta 128(%0)\n"
"   prefetchnta 192(%0)\n"
"   prefetchnta 256(%0)\n"
: : "r" (from) );

for(i=0; i __asm__ __volatile__ (
"1: prefetchnta 320(%0)\n"
"2: movq (%0), %%mm0\n"
"   movq 8(%0), %%mm1\n"
"   movq 16(%0), %%mm2\n"
"   movq 24(%0), %%mm3\n"
"   movq %%mm0, (%1)\n"
"   movq %%mm1, 8(%1)\n"
"   movq %%mm2, 16(%1)\n"
"   movq %%mm3, 24(%1)\n"
"   movq 32(%0), %%mm0\n"
"   movq 40(%0), %%mm1\n"
"   movq 48(%0), %%mm2\n"
"   movq 56(%0), %%mm3\n"
"   movq %%mm0, 32(%1)\n"
"   movq %%mm1, 40(%1)\n"
"   movq %%mm2, 48(%1)\n"
"   movq %%mm3, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}

/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);

return to;
}

void *
memcpy_arjanv_movntq (void *to, const void *from, size_t len)
{
int i;

__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
"   prefetchnta 64(%0)\n"
"   prefetchnta 128(%0)\n"
"   prefetchnta 192(%0)\n"
: : "r" (from) );

for(i=0; i __asm__ __volatile__ (
"   prefetchnta 200(%0)\n"
"   movq (%0), %%mm0\n"
"   movq 8(%0), %%mm1\n"
"   movq 16(%0), %%mm2\n"
"   movq 24(%0), %%mm3\n"
"   movq 32(%0), %%mm4\n"
"   movq 40(%0), %%mm5\n"
"   movq 48(%0), %%mm6\n"
"   movq 56(%0), %%mm7\n"
"   movntq %%mm0, (%1)\n"
"   movntq %%mm1, 8(%1)\n"
"   movntq %%mm2, 16(%1)\n"
"   movntq %%mm3, 24(%1)\n"
"   movntq %%mm4, 32(%1)\n"
"   movntq %%mm5, 40(%1)\n"
"   movntq %%mm6, 48(%1)\n"
"   movntq %%mm7, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);

return to;
}

void *
memcpy_arjanv_interleave (void *to, const void *from, size_t len)
{
int i;

__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
"   prefetchnta 64(%0)\n"
"   prefetchnta 128(%0)\n"
"   prefetchnta 192(%0)\n"
: : "r" (from) );


for(i=0; i __asm__ __volatile__ (
"   prefetchnta 168(%0)\n"
"   movq (%0), %%mm0\n"
"   movntq %%mm0, (%1)\n"
"   movq 8(%0), %%mm1\n"
"   movntq %%mm1, 8(%1)\n"
"   movq 16(%0), %%mm2\n"
"   movntq %%mm2, 16(%1)\n"
"   movq 24(%0), %%mm3\n"
"   movntq %%mm3, 24(%1)\n"
"   movq 32(%0), %%mm4\n"
"   movntq %%mm4, 32(%1)\n"
"   movq 40(%0), %%mm5\n"
"   movntq %%mm5, 40(%1)\n"
"   movq 48(%0), %%mm6\n"
"   movntq %%mm6, 48(%1)\n"
"   movq 56(%0), %%mm7\n"
"   movntq %%mm7, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);

return to;
}

static void wrap (char *p1,
  char *p2,
  size_t size,
  long loops,
  void *(*bfn) (void *, const void *, size_t),
  const char *msg)
{
long t;
int i, j;
char *tmp1, *tmp2;


memset(p2,42,size);

tmp1 = p1;
tmp2 = p2;

start_timer();

for (j = 0; j < 1024; j++) {
for (i=0; i bfn (tmp1, tmp2, size);
tmp1 += size;
tmp2 += size;
}
tmp1 = p1;
tmp2 = p2;
}

t = end_timer();

print_time (msg, loops, t);
}

static void memcpy_test(size_t size)
{
long loops = 1024*1024 / size;

/* We need to make sure the blocks are *VERY* aligned, because
   MMX is potentially pretty fussy. */

char *p1 = (char *) malloc (1024 * 1024);
char *p2 = (char *) malloc (1024 * 1024);

printf("addr1=%p addr2=%p\n", p1, p2);

if (size > 2048)
printf ("memcpy %dkB -- %ld loops\n", size>>10, loops);
else
printf ("memcpy %dB -- %ld loops\n", size, loops);


printf ("  aligned blocks\n");

wrap (p1, p2, size, loops, memcpy, "libc memcpy");
wrap (p1, p2, size, loops, memcpy_rep_movsl, "rep movsw");
wrap (p1, p2, size, loops, memcpy_words, "asm loop");
wrap (p1, p2, size, loops, i686_copyin, "i686_copyin");
wrap (p1, p2, size, loops, i686_copyin2, "i686_copyin2");
wrap (p1, p2, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1, p2, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1, p2, size, loops, memcpy_simple_movusb,
      "with simple MOVUSB (no prefetch)");
wrap (p1, p2, size, loops, memcpy_arjanv,
      "arjanv's MOVQ (with prefetch)");
wrap (p1, p2, size, loops, memcpy_arjanv_movntq,
      "arjanv's MOVNTQ (with prefetch, for Athlon)");
wrap (p1, p2, size, loops, memcpy_arjanv_interleave,
      "arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA");

printf ("  +0/+4 moderately unaligned blocks\n");

wrap (p1, p2+4, size, loops, memcpy, "libc memcpy");
wrap (p1, p2+4, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1, p2+4, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1, p2+4, size, loops, memcpy_simple_movusb,
      "with simple MOVUSB (no prefetch)");

printf ("  +10/+13 cruelly unaligned blocks\n");

wrap (p1+10, p2+13, size, loops, memcpy, "libc memcpy");
wrap (p1+10, p2+13, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1+10, p2+13, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1+10, p2+13, size, loops, memcpy_simple_movusb,
      "with simple MOVUSB (no prefetch)");

puts("");

free(p1); free(p2);
}


int main (void)
{
memcpy_test(64);
#if 0
memcpy_test(1<<7);
memcpy_test(1<<8);
memcpy_test(1<<9);
#endif
memcpy_test(1024);
#if 0
memcpy_test(1<<11);
#endif
memcpy_test(4096);
#if 0
memcpy_test(1<<13);
memcpy_test(1<<14);
memcpy_test(1<<15);
#endif
memcpy_test(1<<16);
memcpy_test(1<<17);
memcpy_test(1<<18);
memcpy_test(1<<19);
#if 0
memcpy_test(1<<20);
#endif
return 0;
}


--------------------next---------------------
On Wed, Oct 16, 2002 at 04:18:30AM +0900, Bang Jun-Young wrote:
> Another attached patch is i686 version of copyin(9) that makes use
> of MMX insns. It works well with intops-only programs, but doesn't
> with ones like XFree86 that uses FP ops. In this case, it would be
> helpful if NPX handling code was imported from FreeBSD (they have
> i586 optimized version of copyin/out(9)). Can anybody give me some
> comments wrt this?
Yup, there's a lot to be had by using SSE(2) instructions, copying
in 128bit quantities is quite a useful thing to do. It's been
on my todo list for a while.
I've been playing with a few SSE memcpy functions myself, but
did not get around to adding the extra checks to the FP
save/restore code yet. There are some checks that need to
be done. It comes down to:
* Don't mess up the current process' FP state, so save it if necessary.
* Don't bother if there's not enough bytes to copy, since you're
paying the price of an entire FP save if someone was using the FPU.
* If you're going all the way, and are using memcpy with SSE in
the kernel too, be careful about interrupts. If you come in
during the FP save path, it will mess up things. And maybe
you don't want to use FP in an interrupt at all, it'll
cause a ton of fp save/restore actions.
It's not overly complicated to do, but it's important to take all
scenarios into account. copyin/out is the simplest case, since
you should be in a process context when doing those.
I'll probably have some time to spend on this soon (next month).
If you're going to work on it before than, please let me review
the changes.
- Frank
------------------------------------------
Posted to Phorum via PhorumMail

--------------------next---------------------
Here is a new version of i686_copyin(). By saving FPU state in stack,
I could make it work with programs that use FP operations, including
XFree86, xmms, mozilla, etc.
In this version, I set the minimum length to use MMX bcopy to 512.
Since I don't know of a kernel profiling tool or a method to measure
copyin performance at kernel level, the number may be too small, or
too large.
Possible todo:
- i686_copyout(), i686_kcopy(), i686_memcpy(), ...
- use prefetch and movntq instructions for PIII/4 or Athlon.
- use npxproc to eliminate overhead in saving FPU state as
FreeBSD does.
Index: locore.s
===================================================================
RCS file: /cvsroot/syssrc/sys/arch/i386/i386/locore.s,v
retrieving revision 1.265
diff -u -r1.265 locore.s
--- locore.s 2002/10/05 21:20:00 1.265
+++ locore.s 2002/10/22 16:42:17
@@ -951,7 +951,7 @@
#define DEFAULT_COPYIN _C_LABEL(i386_copyin) /* XXX */
#elif defined(I686_CPU)
#define DEFAULT_COPYOUT _C_LABEL(i486_copyout) /* XXX */
-#define DEFAULT_COPYIN _C_LABEL(i386_copyin) /* XXX */
+#define DEFAULT_COPYIN _C_LABEL(i686_copyin) /* XXX */
#endif

.data
@@ -1159,6 +1159,114 @@
xorl %eax,%eax
ret
#endif /* I386_CPU || I486_CPU || I586_CPU || I686_CPU */
+
+#if defined(I686_CPU)
+/* LINTSTUB: Func: int i686_copyin(const void *uaddr, void *kaddr, size_t len) */
+ENTRY(i686_copyin)
+ pushl %esi
+ pushl %edi
+ pushl %ebx
+ GET_CURPCB(%eax)
+ movl $_C_LABEL(i686_copy_fault),PCB_ONFAULT(%eax)
+
+ movl 16(%esp),%esi
+ movl 20(%esp),%edi
+ movl 24(%esp),%eax
+
+ /*
+ * We check that the end of the destination buffer is not past the end
+ * of the user's address space. If it's not, then we only need to
+ * check that each page is readable, and the CPU will do that for us.
+ */
+ movl %esi,%edx
+ addl %eax,%edx
+ jc _C_LABEL(i686_copy_efault)
+ cmpl $VM_MAXUSER_ADDRESS,%edx
+ ja _C_LABEL(i686_copy_efault)
+
+ cmpl $512,%eax
+ jb 2f
+
+ xorl %ebx,%ebx
+ movl %eax,%edx
+ shrl $6,%edx
+
+ /*
+ * Save FPU state in stack.
+ */
+ smsw %cx
+ clts
+ subl $108,%esp
+ fnsave 0(%esp)
+
+1:
+ movq (%esi),%mm0
+ movq 8(%esi),%mm1
+ movq 16(%esi),%mm2
+ movq 24(%esi),%mm3
+ movq 32(%esi),%mm4
+ movq 40(%esi),%mm5
+ movq 48(%esi),%mm6
+ movq 56(%esi),%mm7
+ movq %mm0,(%edi)
+ movq %mm1,8(%edi)
+ movq %mm2,16(%edi)
+ movq %mm3,24(%edi)
+ movq %mm4,32(%edi)
+ movq %mm5,40(%edi)
+ movq %mm6,48(%edi)
+ movq %mm7,56(%edi)
+
+ addl $64,%esi
+ addl $64,%edi
+ incl %ebx
+ cmpl %edx,%ebx
+ jb 1b
+
+ /*
+ * Restore FPU state.
+ */
+ frstor 0(%esp)
+ addl $108,%esp
+ lmsw %cx
+
+ andl $63,%eax
+ je 3f
+
+2:
+ /* bcopy(%esi, %edi, %eax); */
+ cld
+ movl %eax,%ecx
+ shrl $2,%ecx
+ rep
+ movsl
+ movb %al,%cl
+ andb $3,%cl
+ rep
+ movsb
+
+3:
+ GET_CURPCB(%edx)
+ xorl %eax,%eax
+ popl %ebx
+ popl %edi
+ popl %esi
+ movl %eax,PCB_ONFAULT(%edx)
+ ret
+
+/* LINTSTUB: Ignore */
+NENTRY(i686_copy_efault)
+ movl $EFAULT,%eax
+
+/* LINTSTUB: Ignore */
+NENTRY(i686_copy_fault)
+ GET_CURPCB(%edx)
+ movl %eax,PCB_ONFAULT(%edx)
+ popl %ebx
+ popl %edi
+ popl %esi
+ ret
+#endif /* I686_CPU */

/* LINTSTUB: Ignore */
NENTRY(copy_efault)
Jun-Young
--
Bang Jun-Young
------------------------------------------
Posted to Phorum via PhorumMail

--------------------next---------------------
A few things:
* i686_copyout() is actually pretty important, because e.g.
we don't have zero-copy socket reads yet (only writes), so
a fast copy routine is important there.
* Same for i686_kcopy() - it's used in the NFS path, at least,
and could significantly improve performance there.
* i686_memcpy() - be careful, because you have the whole
"memcpy() is allowed in interrupts" thing. It's probably
not worth bothering with this one, because there's a
potential to spend a LOT of time saving/restoring FPU
context.
* Yes, only save/restore the FP state if npxproc != NULL.
In the MULTIPROCESSOR case, you also need to be careful
because you could get an IPI from another CPU requesting
the FP state, so you'll need to make sure to provide the
correct one!
In fact, it's probably best to save to the npxproc's PCB,
and restore it back from there, rather than the stack.
(Cuts down on potentially large stack usage, too.)
* You have to handle the fxsave/fxrstor case, i.e. if the CPU
has SSE/SSE2.

--------------------next---------------------
I've done some experiments on my slot-A athlon 700.
The libc memcpy is slow (on modern cpus) because of the setup cost of
executing 'rep movs' instructions. In particular the one used to
copy the remaining (0-3) bytes is particularly expensive.
'rep movsl' only starts to win for copies over (about) 200 bytes.
(when the mmx copy is still 50% faster).
For small blocks (probably the commonest?) I get:
addr1=0x804c000 addr2=0x804c080
memcpy 64B -- 16777216 loops
aligned blocks
libc memcpy 1.721654 s
rep movsw 1.310823 s
asm loop 1.000972 s
MMX memcpy using MOVQ 0.762467 s
arjanv's MOVQ (with prefetch) 0.905702 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.559139 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.556865 s
+0/+4 moderately unaligned blocks
libc memcpy 1.715516 s
rep movsw 1.310894 s
asm loop 1.000683 s
MMX memcpy using MOVQ 0.881484 s
+10/+13 cruelly unaligned blocks
libc memcpy 1.996214 s
rep movsw 1.619813 s
asm loop 1.190194 s
MMX memcpy using MOVQ 1.024688 s
where the 'rep movsl' and 'asm loop' are:

#include
ENTRY(memcpy_rep_movsl)
pushl %esi
pushl %edi
movl 20(%esp),%ecx
movl 12(%esp),%edi
movl 16(%esp),%esi
movl %edi,%eax /* return value */
movl %ecx,%edx
cld /* copy forwards. */
shrl $2,%ecx /* copy by words */
rep
movsl
testl $3,%edx
jne 1f
2: popl %edi
popl %esi
ret
1:
movl %edx,%ecx
rep
movsb
jmp 2b
ENTRY(memcpy_words)
pushl %esi
pushl %edi
movl 12(%esp),%edi
movl 16(%esp),%esi
movl 20(%esp),%ecx
pushl %ebp
pushl %ebx
shrl $4,%ecx
1:
movl 0(%esi),%eax
movl 4(%esi),%edx
movl 8(%esi),%ebx
movl 12(%esi),%ebp
addl $16,%esi
subl $1,%ecx
movl %eax,0(%edi)
movl %edx,4(%edi)
movl %ebx,8(%edi)
movl %ebp,12(%edi)
leal 16(%edi),%edi
jne 1b
/* We ought to do the remainder here... */
popl %ebx
popl %ebp
movl 12(%esp),%eax
popl %edi
popl %esi
ret
David


--------------------next---------------------

阅读(1127) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~