首页　| 　博文目录　| 　关于我

博客访问： 2495184
博文数量： 392
博客积分： 7040
博客等级：少将
技术积分： 4138
用户组：普通用户
注册时间： 2009-06-17 13:03

个人简介

范德萨发而为

文章分类

全部博文（392）

nosql（1）
c/c++（7）
machine lea（67）
设计模式（1）
web架构（35）
关系型database（23）
distributed（11）
fuckingwindows（1）
SE（24）
life（9）
berkeleyDB（4）
beauty of math（3）
Java_study（11）
algorithm（77）
kernel（16）
hadoop（13）
programming（8）
network（9）
linux operation（14）
bash（12）
reading（5）
STL using（8）
intern（0）
job_hunter（29）
未分配的博文（4）

文章存档

2017年（5）

2016年（19）

2015年（34）

2014年（14）

2013年（47）

2012年（40）

2011年（51）

2010年（137）

2009年（45）

我的朋友

Evaluation of clustering

Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity (documents within a cluster are similar) and low inter-cluster similarity (documents from different clusters are dissimilar). This is an internal criterion for the quality of a clustering. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. For search result clustering, we may want to measure the time it takes users to find an answer with different clustering algorithms. This is the most direct evaluation, but it is expensive, especially if large user studies are necessary.

As a surrogate for user judgments, we can use a set of classes in an evaluation benchmark or gold standard (see Section , page , and Section , page ). The gold standard is ideally produced by human judges with a good level of inter-judge agreement (see Chapter , page ). We can then compute an external criterion that evaluates how well the clustering matches the gold standard classes. For example, we may want to say that the optimal clustering of the search results for jaguar in Figure consists of three classes corresponding to the three senses car, animal, and operating system. In this type of evaluation, we only use the partition provided by the gold standard, not the class labels.

This section introduces four external criteria of clustering quality. Purity is a simple and transparent evaluation measure. Normalized mutual information can be information-theoretically interpreted. The Rand index penalizes both false positive and false negative decisions during clustering. The F measure in addition supports differential weighting of these two types of errors.

.begin{figure}
% latex2html id marker 24270
.par
.psset{unit=0.75cm}
.par
.begin...
...$,
3
(cluster 3).
Purity is $(1/17) .times (5+4+3)
.approx 0.71$.
}
.end{figure}

To compute purity , each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by . Formally:

$.begin{displaymath} .mbox{purity}( .Omega,.mathbb{C} ) = .frac{1}{N} .sum_k .max_j .vert.omega_k .cap c_j.vert .end{displaymath}$

(182)

where $.Omega = .{ .omega_1, .omega_2, .ldots, .omega_K .}$ is the set of clusters and $.mathbb{C} = .{ c_1,c_2,.ldots,c_J .}$ is the set of classes. We interpret

as the set of documents in

and

as the set of documents in

in Equation .

We present an example of how to compute purity in Figure . Bad clusterings have purity values close to 0, a perfect clustering has a purity of 1 . Purity is compared with the other three measures discussed in this chapter in Table .

Table 16.2: The four external evaluation measures applied to the clustering in Figure .

	purity	NMI	RI
lower bound	0.0	0.0	0.0	0.0
maximum	1	1	1	1
value for Figure	0.71	0.36	0.68	0.46

High purity is easy to achieve when the number of clusters is large - in particular, purity is 1 if each document gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.

A measure that allows us to make this tradeoff is normalized mutual information or NMI :

$.begin{displaymath} .mbox{NMI}(.Omega , .mathbb{C}) = .frac{ I(.Omega ; .mathbb{C}) } { [H(.Omega)+ H(.mathbb{C} )]/2 } .end{displaymath}$

(183)

is mutual information (cf. Chapter , page ):

		$.displaystyle .sum_k .sum_j P(.omega_k .cap c_j) .log .frac{P(.omega_k .cap c_j)}{P(.omega_k)P(c_j)}$	(184)
		$.displaystyle .sum_k .sum_j .frac{.vert.omega_k .cap c_j.vert}{N} .log .frac{N.vert.omega_k .cap c_j.vert}{.vert.omega_k.vert.vert c_j.vert}$	(185)

where

, and

are the probabilities of a document being in cluster

, class

, and in the intersection of

and

, respectively. Equation is equivalent to Equation for maximum likelihood estimates of the probabilities (i.e., the estimate of each probability is the corresponding relative frequency).

is entropy as defined in Chapter (page ):

			(186)
		$.displaystyle -.sum_k .frac{.vert.omega_k.vert}{N} .log .frac{.vert.omega_k.vert}{N}$	(187)

where, again, the second equation is based on maximum likelihood estimates of the probabilities.

in Equation measures the amount of information by which our knowledge about the classes increases when we are told what the clusters are. The minimum of is 0 if the clustering is random with respect to class membership. In that case, knowing that a document is in a particular cluster does not give us any new information about what its class might be. Maximum mutual information is reached for a clustering $.Omega_{exact}$ that perfectly recreates the classes - but also if clusters in $.Omega_{exact}$ are further subdivided into smaller clusters (Exercise ). In particular, a clustering with one-document clusters has maximum MI. So MI has the same problem as purity: it does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.

The normalization by the denominator in Equation fixes this problem since entropy tends to increase with the number of clusters. For example, reaches its maximum for , which

ensures that NMI is low for . Because NMI is normalized, we can use it to compare clusterings with different numbers of clusters. The particular form of the denominator is chosen because is a tight upper bound on (Exercise ). Thus, NMI is always a number between 0 and 1.

An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions, one for each of the pairs of documents in the collection. We want to assign two documents to the same cluster if and only if they are similar. A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision assigns two dissimilar documents to different clusters. There are two types of errors we can commit. A (FP) decision assigns two dissimilar documents to the same cluster. A (FN) decision assigns two similar documents to different clusters. The Rand index ( ) measures the percentage of decisions that are correct. That is, it is simply accuracy (Section , page 8.3 ).

$.begin{eqnarray*} .mbox{RI} = .frac{.mbox{TP}+.mbox{TN}}{.mbox{TP}+.mbox{FP}+.mbox{FN}+.mbox{TN}} .end{eqnarray*}$

As an example, we compute RI for Figure . We first compute . The three clusters contain 6, 6, and 5 points, respectively, so the total number of ``positives'' or pairs of documents that are in the same cluster is:

(188)

Of these, the x pairs in cluster 1, the o pairs in cluster 2, the

pairs in cluster 3, and the x pair in cluster 3 are true positives:

(189)

Thus,

and are computed similarly, resulting in the following contingency table:

Same cluster Different clusters

Same class

Different classes

is then

The Rand index gives equal weight to false positives and false negatives. Separating similar documents is sometimes worse than putting pairs of dissimilar documents in the same cluster. We can use the F measure measuresperf to penalize false negatives more strongly than false positives by selecting a value , thus giving more weight to recall.

$.begin{eqnarray*} P = .frac{.mbox{TP}}{.mbox{TP}+.mbox{FP}} .qquad R = .frac{.mb... ...+.mbox{FN}} .qquad F_{.beta} = .frac{(.beta^2+1)PR}{.beta^2 P+R} .end{eqnarray*}$

Based on the numbers in the contingency table,

and

. This gives us

for

and

for

. In information retrieval, evaluating clustering with

has the advantage that the measure is already familiar to the research community.

Exercises.

Replace every point in Figure with two identical copies of in the same class. (i) Is it less difficult, equally difficult or more difficult to cluster this set of 34 points as opposed to the 17 points in Figure ? (ii) Compute purity, NMI, RI, and for the clustering with 34 points. Which measures increase and which stay the same after doubling the number of points? (iii) Given your assessment in (i) and the results in (ii), which measures are best suited to compare the quality of the two clusterings?

Next:  Up:  Previous:        © 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the  of the book.
2009-04-07

阅读(6235) | 评论(0) | 转发(0) |

上一篇：active learning资料

下一篇：聚类相关资料

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6