About Index Cardinality-r_a-ChinaUnix博客

Filed under: , — richard @ 9:14 am

Cardinality refers to how many unique values a particular column contains with respect to the number of rows in that table. This is useful to know when planning a new index or normalising a table.

Consider the following table of data:

user_id	user_name	department	building
1	bob	IT	Head office
2	frank	IT	Head office
3	sam	Marketing	Head office
4	dave	Marketing	Head office
5	steve	Marketing	Head office
6	james	Sales	Head office
7	ruth	Sales	Head office
8	lisa	Sales	Head office
9	nigel	Tech support	Mobile
10	kelly	Tech support	Mobile

We will, for the purposes of the example, completely ignore the fact that this data is not normalised.

The ‘user_id’ column has ‘high’ cardinality - there are ten different values in it, out of a total of ten rows. This is also true of the ‘user_name’ column. The ‘department’ column has ‘normal’ cardinality - there are four different values in it, and the ‘building’ column has ‘low’ cardinality - there are only two different values in it.

There are no hard and fast rules about what constitutes high, normal or low cardinality - this largely depends on the data you’re working with, although a cardinality of 1 or 2 would always be considered low cardinality. For small tables with few rows, only having a cardinality of 3 might be considered a normal cardinality, as would a high number in a table with many rows.

Personally, I divide the number of unique values by the total number of rows to decide on the level of cardinality. Anything with a score of 0.8 or more I consider to be high cardinality, a score of 0.2 or less as low cardinality, and everything else being normal.

High cardinality tables are generally unavoidable, and there’s nothing wrong with them, unless you are duplicating data or have created an unnecessary primary key.

A column with very low cardinality may be a candidate for normalisation, especially if the column is textual in nature. In these cases, replacing the largely identical data held in the column with an integer reference to another table would definitely decrease the amount of storage required for the table (which can only improve query performance, even after the new lookup table has been joined), and it may also allow you to create new multi-column indexes which may further improve performance. Putting the data into another table will not affect the cardinality of the column - there are still the number of different values in the column, it’s just the format that has changed.

There are a few situations where indexing a high cardinality column makes sense, but normally indexes are made on normal and low cardinality columns. Indexes on low cardinality columns may be ignored by the query optimiser, but they be useful for specific queries.