分类:
2009-05-15 13:34:52
该文列出了几种方法来统计一个数的二进制格式里含有的1个数,蛮有意思的,记录一下
Here are C implementations of all the methods for counting 1 bits mentioned in that node. (Go read that first, if you haven't already.) All of the statistical information is purely anecdotal, but for what it's worth, it's based on my testing the code on a Pentium 3 and a Celeron 2, using the cl compiler of Microsoft Visual C++, and on a Sun Ultra 5, using gcc and Sun's own cc. For testing 64-bit code, I used __int64 on the Intel machines, and long long on the Sparc. It's worth noting that while Sun's compiler outputs faster executables than gcc, it doesn't change the relative performance of the different methods.
unsigned
or int
makes it a little slower (what with the extra casting and byte-loading the compiler is forced to add.)
unsigned numbits_lookup_table[256] = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2,
3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3,
3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3,
4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4,
3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5,
6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4,
4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5,
6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3,
4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6,
6, 7, 6, 7, 7, 8
};
unsigned numbits_lookup(unsigned i)
{
unsigned n;
n = numbits_lookup_table[i & 0xff];
n += numbits_lookup_table[i>>8 & 0xff];
n += numbits_lookup_table[i>>16 & 0xff];
n += numbits_lookup_table[i>>24 & 0xff];
return n;
}
unsigned numbits(unsigned int i)
{
unsigned int const MASK1 = 0x55555555;
unsigned int const MASK2 = 0x33333333;
unsigned int const MASK4 = 0x0f0f0f0f;
unsigned int const MASK8 = 0x00ff00ff;
unsigned int const MASK16 = 0x0000ffff;
i = (i&MASK1 ) + (i>>1 &MASK1 );
i = (i&MASK2 ) + (i>>2 &MASK2 );
i = (i&MASK4 ) + (i>>4 &MASK4 );
i = (i&MASK8 ) + (i>>8 &MASK8 );
i = (i&MASK16) + (i>>16&MASK16);
return i;
}
Some symbols that I'll use to represent what's going on:
0
: bits we know are zero from the previous stepo
: bits we know are zero due to masking-
: bits we know are zero due to shiftingX
: bits that might be 1 and we care about their valuesx
: bits that might be 1 but we don't care about their values
So a 0
plus a 0
is still a 0
, obviously; the tricky ones are the others, but they're not even so bad. 0
plus X
is X
, since if the X
is a 0
or a 1
, added to 0
it will pass through unchanged. However, X
plus X
is XX
, because the sum can range from 0
(0+0
), to 10
(1+1
). The same holds true with x
s, once those show up.
Step 1:
oXoXoXoXoXoXoXoXoXoXoXoXoXoXoXoXStep 2:
+ -XoXoXoXoXoXoXoXoXoXoXoXoXoXoXoX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ooXXooXXooXXooXXooXXooXXooXXooXXStep 3:
+ --XXooXXooXXooXXooXXooXXooXXooXX
0XXX0XXX0XXX0XXX0XXX0XXX0XXX0XXX
oooo0XXXoooo0XXXoooo0XXXoooo0XXXStep 4:
+ ----0XXXoooo0XXXoooo0XXXoooo0XXX
0000XXXX0000XXXX0000XXXX0000XXXX
oooooooo0000XXXXoooooooo0000XXXXStep 5:
+ --------0000XXXXoooooooo0000XXXX
00000000000XXXXX00000000000XXXXX
oooooooooooooooo00000000000XXXXXYou'll notice that the higher the step, the more known zeros (
+ ----------------00000000000XXXXX
00000000000000000000000000XXXXXX
0
) there are. call's suggestion was to change step 5 to this:ooooooooooooxxxx00000000000XXXXX(where "
+ ----------------00000000000XXXXX
000000000000xxxx0000000000XXXXXX
(mask) ooooooooooooooooooooooooooXXXXXX
(mask)
" means "after adding, apply a mask".)
However, you can go back even further and apply the same technique - all the way to step 3, in fact. The best I can think to optimize this changes the last three steps into the following: Step 3:
0xxx0XXX0xxx0XXX0xxx0XXX0xxx0XXXStep 4:
+ ----0XXX0xxx0XXX0xxx0XXX0xxx0XXX
0xxxXXXX0xxxXXXX0xxxXXXX0xxxXXXX
(mask) 0000XXXX0000XXXX0000XXXX0000XXXX
0000xxxx0000XXXX0000xxxx0000XXXXStep 5:
+ --------0000XXXX0000xxxx0000XXXX
0000xxxx000XXXXX000xxxxx000XXXXX
0000xxxx000xxxxx000xxxxx000XXXXXAnyway, that's all very lovely, but here's the C to do it:
+ ----------------000xxxxx000XXXXX
0000xxxx000xxxxx00xxxxxx00XXXXXX
(mask) ooooooooooooooooooooooooooXXXXXX
unsigned numbits(unsigned int i)The performance on this method is marginally worse than the lookup method in the 32 bit cases, slightly better than lookup on 64 bit Intel, and right about the same on 64 bit Sparc. Of note is the fact that loading one of these bitmasks into a register actually takes two instructions on RISC machines, and a longer-than-32-bit instruction on the Intel, because it's impossible to pack an instruction and 32 bits worth of data into a single 32 bit instruction. See the bottom of jamesc's writeup at MIPS for more details on that...
{
unsigned int const MASK1 = 0x55555555;
unsigned int const MASK2 = 0x33333333;
unsigned int const MASK4 = 0x0f0f0f0f;
unsigned int const MASK6 = 0x0000003f;
unsigned int const w = (v & MASK1) + ((v >> 1) & MASK1);
unsigned int const x = (w & MASK2) + ((w >> 2) & MASK2);
unsigned int const y = (x + (x >> 4) & MASK4);
unsigned int const z = (y + (y >> 8));
unsigned int const c = (z + (z >> 16)) & MASK6;
return c;
}
unsigned numbits(unsigned int i)This method is identical to the "Optimized Counters" method, with two tricks applied:
{
unsigned int const MASK1 = 0x55555555;
unsigned int const MASK2 = 0x33333333;
unsigned int const MASK4 = 0x0f0f0f0f;
unsigned int const w = v - ((v >> 1) & MASK1);
unsigned int const x = (w & MASK2) + ((w >> 2) & MASK2);
unsigned int const y = (x + (x >> 4) & MASK4);
unsigned int const c = (y * 0x01010101) >> 24;
return c;
}
00 - 0 = 0
, 01 - 0 = 01
, 10 - 1 = 01
, 11 - 1 = 10
unsigned numbits_subtraction(unsigned i)
{
unsigned n;
for(n=0; i; n++)
i &= i-1;
return n;
}
if (i & 1) n++;
, which uses a branch instruction, just add the actual value of the least-significant bit to the counter ( n += (i & 1);
), as it will be a 1 when you want to add 1, and 0 when you don't. (We're just twiddling bits
anyway, so why not?) This actually makes the processor do more adds,
but adding is fast, and branching is slow, on modern processors, so it
turns out to be about twice as fast. However, even "twice as fast" is
still four to five times slower than the lookup method, again, on all
architectures.
unsigned numbits(unsigned int i)Now, why does this all matter? It doesn't, really, but it was sure a good way to waste some time, and maybe someone learned some optimizing tricks from it... (Well, I did, actually - so I hope someone else did as well.)
{
unsigned n;
for(n=0; i; i >>= 1)
#ifndef MORE_OPTIMAL
if (i & 1) n++;
#else
n += (i & 1);
#endif
return n;
}
chinaunix网友2009-05-15 16:16:38
#include
chinaunix网友2009-05-15 14:36:04
n = ((n >> 1) & 0x55555555) | ((n << 1) & 0xaaaaaaaa); n = ((n >> 2) & 0x33333333) | ((n << 2) & 0xcccccccc); n = ((n >> 4) & 0x0f0f0f0f) | ((n << 4) & 0xf0f0f0f0); n = ((n >> 8) & 0x00ff00ff) | ((n << 8) & 0xff00ff00); n = ((n >> 16) & 0x0000ffff) | ((n << 16) & 0xffff0000); 此为反转位序的方法