~oz/hash.html
Hash Functions
A comprehensive collection of hash functions, a hash visualiser and some
test
results [see Mckenzie et al. Selecting a Hashing Algorithm,
SP&E
20(2):209-224, Feb 1990] will be available someday. If you just want to
have a good hash function, and cannot wait, djb2 is one of the
best
string hash functions i know. it has excellent distribution and speed on
many
different sets of keys and table sizes. you are not likely to do better
with
one of the "well known" functions such as PJW, K&R[1], etc.
Also see pp. 126
for
graphing hash functions.
this algorithm (k=33) was first reported by dan bernstein many years
ago in comp.lang.c. another version of this algorithm (now favored
by bernstein) uses xor:
hash(i) = hash(i - 1) * 33 ^ str[i];
the magic of number 33 (why it works better than many other
constants, prime or not) has never been adequately explained.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
this algorithm was created for sdbm (a public-domain reimplementation of
ndbm)
database library. it was found to do well in scrambling bits, causing
better
distribution of the keys and fewer splits. it also happens to be a good
general hashing function with good distribution. the actual function is
hash(i) = hash(i - 1) * 65599 + str[i]; what is included below
is the
faster version used in gawk. [there is even a faster, duff-device
version] the
magic constant 65599 was picked out of thin air while experimenting with
different constants, and turns out to be a prime. this is one of the
algorithms used in berkeley db (see
This hash function appeared in K&R (1st ed) but at least the reader
was
warned: "
This is not the best possible algorithm, but it has the
merit of
extreme simplicity." This is an understatement; It is a
terrible
hashing algorithm, and it could have been much better without
sacrificing its
"extreme simplicity." [see the second edition!] Many C programmers
use this function without actually testing it, or checking something
like
Knuth's
Sorting and Searching, so it stuck. It is now found
mixed
with otherwise respectable code, eg. cnews. sigh. [see also:
]
unsigned long
hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
---------------------------------------------------------
Hash Functions
As mentioned briefly in the previous section, there are multiple ways
for
constructing a function. Remember that hash
function takes the data as
input (often a string), and return s an integer in the range of possible
indices into the hash table. Every hash function must do that,
including
the bad ones. So what makes for a good hash function?
Characteristics of a Good Hash Function
There are four main characteristics of a good hash function:
1) The hash value is fully determined by the data being hashed.
2) The hash function uses all the input data.
3) The hash function "uniformly" distributes the data across the entire
set
of possible hash values.
4) The hash function generates very different hash values for similar
strings.
Let's examine why each of these is important:
Rule 1: If something else besides the input data is used to determine
the
hash, then the hash value is not as dependent upon the input data, thus
allowing for a worse distribution of the hash values.
Rule 2: If the hash function doesn't use all the input data, then slight
variations to the input data would cause an inappropriate number of
similar
hash values resulting in too many .
Rule 3: If the hash function does not uniformly distribute the data
across
the entire set of possible hash values, a large number of collisions
will
result, cutting down on the efficiency of the hash table.
Rule 4: In real world applications, many data sets contain very similar
data elements. We would like these data elements to still be
distributable
over a hash table.
So let's take as an example the hash function used in the last section:
int hash(char *str, int table_size)
{
int sum;
// Make sure a valid string passed in
if (str==NULL) return -1;
// Sum up all the characters in the string
for( ; *str; str++) sum += *str;
// Return the sum mod the table size
return sum % table_size;
}
Which rules does it break and satisfy?
Rule 1: Satisfies. The hash value is fully determined by the data being
hashed. The hash value is just the sum of all the input characters.
Rule 2: Satisfies. Every character is summed.
Rule 3: Breaks. From looking at it, it isn't obvious that it doesn't
uniformly distribute the strings, but if you were to analyze this
function
for a large input you would see certain statistical properties bad for a
hash function.
Rule 4: Breaks. Hash the string "bog". Now hash the string "gob".
They're
the same. Slight variations in the string should result in different
hash
values, but with this function they often don't.
So this hash function isn't so good. It's a good introductory example
but
not so good in the long run.
There are many possible ways to construct a better hash function (doing a
web search will turn up hundreds) so we won't cover too many here except
to present a few decent examples of hash functions:
/* Peter Weinberger's */
int hashpjw(char *s)
{
char *p;
unsigned int h, g;
h = 0;
for(p=s; *p!='\0'; p++){
h = (h<<4) + *p;
if (g = h&0xF0000000) {
h ^= g>>24;
h ^= g;
}
}
return h % 211;
}
Another one:
/* UNIX ELF hash
* Published hash algorithm used in the UNIX ELF format for object files
*/
unsigned long hash(char *name)
{
unsigned long h = 0, g;
while ( *name ) {
h = ( h << 4 ) + *name++;
if ( g = h & 0xF0000000 )
h ^= g >> 24;
h &= ~g;
}
return h;
}
or possibly:
/* This algorithm was created for the sdbm (a reimplementation of ndbm)
* database library and seems to work relatively well in scrambling bits
*/
static unsigned long sdbm(unsigned char *str)
{
unsigned long hash = 0;
int c;
while (c = *str++) hash = c + (hash << 6) + (hash << 16) - hash;
return hash;
}
or possibly:
/* djb2
* This algorithm was first reported by Dan Bernstein
* many years ago in comp.lang.c
*/
unsigned long hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++) hash = ((hash << 5) + hash) + c; // hash*33 + c
return hash;
}
or another:
char XORhash( char *key, int len)
{
char hash;
int i;
for (hash=0, i=0; i return (hash%101); /* 101 is prime */
}
You get the idea... there are many possible hash functions. For coding up
a hash function quickly, djb2 is usually a good candidate as it is easily
implemented and has relatively good statistical properties.