Chinaunix首页 | 论坛 | 博客
  • 博客访问: 246850
  • 博文数量: 61
  • 博客积分: 2510
  • 博客等级: 少校
  • 技术积分: 800
  • 用 户 组: 普通用户
  • 注册时间: 2008-02-14 15:18
文章分类

全部博文(61)

文章存档

2011年(4)

2010年(5)

2009年(10)

2008年(42)

我的朋友

分类: Mysql/postgreSQL

2008-05-13 18:23:45

Filed under: , — richard @ 9:14 am

Cardinality refers to how many unique values a particular column contains with respect to the number of rows in that table. This is useful to know when planning a new index or normalising a table.

Consider the following table of data:

user_id user_name department building
1 bob IT Head office
2 frank IT Head office
3 sam Marketing Head office
4 dave Marketing Head office
5 steve Marketing Head office
6 james Sales Head office
7 ruth Sales Head office
8 lisa Sales Head office
9 nigel Tech support Mobile
10 kelly Tech support Mobile

We will, for the purposes of the example, completely ignore the fact that this data is not normalised.

The ‘user_id’ column has ‘high’ cardinality - there are ten different values in it, out of a total of ten rows. This is also true of the ‘user_name’ column. The ‘department’ column has ‘normal’ cardinality - there are four different values in it, and the ‘building’ column has ‘low’ cardinality - there are only two different values in it.

There are no hard and fast rules about what constitutes high, normal or low cardinality - this largely depends on the data you’re working with, although a cardinality of 1 or 2 would always be considered low cardinality. For small tables with few rows, only having a cardinality of 3 might be considered a normal cardinality, as would a high number in a table with many rows.

Personally, I divide the number of unique values by the total number of rows to decide on the level of cardinality. Anything with a score of 0.8 or more I consider to be high cardinality, a score of 0.2 or less as low cardinality, and everything else being normal.

High cardinality tables are generally unavoidable, and there’s nothing wrong with them, unless you are duplicating data or have created an unnecessary primary key.

A column with very low cardinality may be a candidate for normalisation, especially if the column is textual in nature. In these cases, replacing the largely identical data held in the column with an integer reference to another table would definitely decrease the amount of storage required for the table (which can only improve query performance, even after the new lookup table has been joined), and it may also allow you to create new multi-column indexes which may further improve performance. Putting the data into another table will not affect the cardinality of the column - there are still the number of different values in the column, it’s just the format that has changed.

There are a few situations where indexing a high cardinality column makes sense, but normally indexes are made on normal and low cardinality columns. Indexes on low cardinality columns may be ignored by the query optimiser, but they be useful for specific queries.

阅读(834) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~