coolshell有一篇讲K-Mean算法的博文,讲的非常好,研究生阶段学过这个算法。记得研究生阶段学过ISODATA算法,当时的家庭作业是这个算法的实现。K-Means比较简单,简单的讲,就是距离比较进的点,应该聚成一类。 K-Means的算法如下(陈皓大牛总结的):
- 随机在图中取K(这里K=2)个种子点。
- 然后对图中的所有点求到这K个种子点的距离,假如点Pi离种子点Si最近,那么Pi属于Si点群。(上图中,我们可以看到A,B属于上面的种子点,C,D,E属于下面中部的种子点)
- 接下来,我们要移动种子点到属于他的“点群”的中心。(见图上的第三步)
- 然后重复第2)和第3)步,直到,种子点没有移动(我们可以看到图中的第四步上面的种子点聚合了A,B,C,下面的种子点聚合了D,E)。
K-Means算法有两个缺点:
1 K 是指定的,给你一堆样本,其实是很难知道应该分成几类的,比如上图,明显是分成四类,但是如果K 取值5,那么结果肯定很蛋疼;
2 种子点的选择是随机的,如果初始点的选择不同,最后的结果就可能不同。
对于第二个缺点,有个K-Means++算法,可以解决,很有效的选择初始种子点。这个算法又称为Lloyd算法,算法步骤如下,也是陈皓总结的:
- 先从我们的数据库随机挑个随机点当“种子点”。
- 对于每个点,我们都计算其和最近的一个“种子点”的距离D(x)并保存在一个数组里,然后把这些距离加起来得到Sum(D(x))。
- 然后,再取一个随机值,用权重的方式来取计算下一个“种子点”。这个算法的实现是,先取一个能落在Sum(D(x))中的随机值Random,然后用Random -= D(x),直到其<=0,此时的点就是下一个“种子点”。
- 重复第(2)和第(3)步直到所有的K个种子点都被选出来。
- 进行K-Means算法。
我是一个NBA球迷,hoopchina是我常逛的论坛,最近林书豪的水平如何是大家口水的话题,我最初的想法是将NBA所有的后卫球员做个聚类分析,看下林书豪属于那个层次的球员。我去NBA官网取下了100名后卫球员的统计数据,格式如下:
- 球员 命中率 三分命中率 罚球命中率 篮板 助攻 抢断 盖帽 失误 犯规 得分
- 科比-布莱恩特 0.551 0.441 0.92 5.5 4.6 1.4 0 3.8 2.5 26.4
- 詹姆斯-哈登 0.441 0.24 0.823 4.6 4.5 1.4 0.8 4.1 2 26.4
- 拉塞尔-威斯布鲁克 0.388 0.279 0.754 4.9 8.5 1.3 0.1 3 2.9 19.6
- 克里斯-保罗 0.478 0.4 0.913 3.3 10.3 2.3 0 2.1 2.4 17
数据太多,我选择了篮板/助攻/得分三个维度来作为衡量后卫球员的指标,我用awk选择了我关心的指标,存成了文件,文件如下:- 科比-布莱恩特 5.5000 4.6000 26.4000
- 詹姆斯-哈登 4.6000 4.5000 26.4000
- 拉塞尔-威斯布鲁克 4.9000 8.5000 19.6000
- 克里斯-保罗 3.3000 10.3000 17.0000
- 尼古拉斯-巴通姆 6.4000 3.1000 19.0000
- Damian-Lillard 3.1000 6.6000 18.4000
- O.J.-梅奥 3.6000 2.9000 21.5000
- 。。。。。。
格式如下:姓名 篮板 助攻 得分。有一个问题是这样的,助攻和得分是不同的,一个球员10个助攻的难度,不低于得20分的,如果不将衡量指标归一化,聚类就会不公平,得分就会变成主要的指标,而篮板和助攻就会变成次要的指标。但是如何归一化?
我采取的算法,计算所有的球员的平均值,比如2个篮板 4个助攻 8分,那么我就会将篮板*4,主攻*2,得分*1,
这样来提高篮板的比重,提高助攻的比重。
下面我们开始看下代码,代码流程包括,加载数据---归一化数据-----K-Meas++聚类。 - #include<stdio.h>
- #include<stdlib.h>
- #include<unistd.h>
- #include<math.h>
- #define NORMALIZE
- #define FILEPATH "./data_main"
- #define MEASURE_MAX
- #define BUFSIZE 4096
- #define PLAYER_NUM 100
- #define NAME_MAX 256
- #define MEASURE_DIMENSION 3
- typedef enum
- {
- HIT_RATE =1,
- HIT_RATE_3P,
- HIT_RATE_FREE,
- BOARD,
- ASSIST,
- STEAL,
- BLOCK,
- TURNOVER,
- FAULT,
- SCORE
- };
- typedef struct player
- {
- char name[NAME_MAX];
- double measure[MEASURE_DIMENSION];
- int group;
- }player;
- double ratio[MEASURE_DIMENSION] = {0.0};
- double ratio_sqrt[MEASURE_DIMENSION] = {0.0};
- double randf(double m)
- {
- return m * rand() / (RAND_MAX - 1.);
- }
- struct player* load_data(const char* path,int player_num)
- {
- FILE* fp= NULL;
- char buf[BUFSIZE] = {0};
- struct player* players = (struct player*) malloc(sizeof(player)*player_num);
- if(players == NULL)
- {
- fprintf(stderr,"malloc failed for players\n");
- return NULL;
- }
- if(access(FILEPATH,R_OK) < 0)
- {
- fprintf(stderr,"can not find the file %s\n",FILEPATH);
- goto err_out;
- }
- fp = fopen(FILEPATH,"rb");
- if(fp == NULL)
- {
- fprintf(stderr,"open file(%s) failed\n",FILEPATH);
- goto err_out;
- }
- int player_index = 0;
- while(fgets(buf,BUFSIZE,fp))
- {
- char *delimit = "\t";
- char *save_ptr;
- char *token=strtok_r(buf,delimit,&save_ptr);
- int field_index = 0;
- while(token != NULL)
- {
- if(field_index == 0)
- {
- strncpy(players[player_index].name,token,NAME_MAX-1);
- players[player_index].name[NAME_MAX-1] = '\0';
- }
- else
- {
- players[player_index].measure[field_index-1] = atof(token);
- }
- token = strtok_r(NULL,delimit,&save_ptr);
- field_index++;
- }
- if(field_index != MEASURE_DIMENSION + 1 )
- {
- fprintf(stderr,"data file have err format ,exit\n" );
- goto err_out;
- }
- player_index++;
- if(player_index == player_num)
- {
- fprintf(stderr,"more than %d players existed in data file\n",player_num);
- break;
- }
- }
- fprintf(stderr,"%d player record got\n",player_index);
- return players;
- err_out:
- if(players)
- {
- free(players);
- return NULL;
- }
- }
- int calc_ratio(struct player* players,int player_num,double* ratio)
- {
- int i ;
- double average[MEASURE_DIMENSION] ;
- for (i = 0;i<MEASURE_DIMENSION;i++)
- {
- average[i] = 0.0;
- }
- int j = 0;
- for(i = 0;i<player_num;i++)
- {
- for(j = 0;j<MEASURE_DIMENSION;j++)
- {
- average[j] +=players[i].measure[j];
- }
- }
- for (i = 0;i<MEASURE_DIMENSION;i++)
- {
- average[i]/=player_num;
- ratio_sqrt[i] = 10/average[i];
- }
- for(i = 0; i<player_num;i++)
- {
- for(j = 0;j<MEASURE_DIMENSION;j++)
- {
- players[i].measure[j] = ratio_sqrt[j]*players[i].measure[j];
- }
- }
- return 0;
- }
- double distance(struct player* player_A,struct player* player_B)
- {
- int i = 0;
- double distance = 0.0;
- for(i = 0 ;i<MEASURE_DIMENSION;i++ )
- {
- distance +=pow( (player_A->measure[i] - player_B->measure[i]),2);
- }
- return distance;
- }
- int nearest(struct player* player, struct player* cent, int n_cluster, double *d2)
- {
- int i, min_i;
- struct player* c;
- double d, min_d;
- //for (c = cent, i = 0; i < n_cluster; i++, c++)
- {
- min_d = HUGE_VAL;
- min_i = player->group;
-
- for (c = cent, i = 0; i < n_cluster; i++, c++)
- {
- if (min_d > (d = distance(c, player))) {
- min_d = d; min_i = i;
- }
- }
- }
- if (d2) *d2 = min_d;
- return min_i;
- }
- void K_findseed(struct player* players, int player_num, struct player* cent, int n_cent)
- {
- int i, j;
- int n_cluster;
- double sum, *d = malloc(sizeof(double) * player_num);
- struct player* p;
- cent[0] = players[ rand() % player_num ];
- for (n_cluster = 1; n_cluster < n_cent; n_cluster++) {
- sum = 0;
-
- for (j = 0, p = players; j < player_num; j++, p++)
- {
- nearest(p, cent, n_cluster, d + j);
- sum += d[j];
- }
- sum = randf(sum);
- for (j = 0, p = players; j < player_num; j++, p++)
- {
- if ((sum -= d[j]) > 0) continue;
- cent[n_cluster] = players[j];
- break;
- }
- }
- for (j = 0, p = players; j < player_num; j++, p++)
- {
- p->group = nearest(p, cent, n_cluster, 0);
- }
- free(d);
- }
- int K_mean_plus(struct player* players,int player_num,int cluster_num)
- {
- struct player* center = malloc(sizeof(player)*cluster_num);
- struct player *p,*c;
- K_findseed(players,player_num,center,cluster_num);
- output_result(players,player_num,cluster_num);
- int i ,j ,min_i;
- int changed;
- do {
-
- for (c = center, i = 0; i < cluster_num; i++, c++)
- {
- c->group = 0;
- for(j = 0;j<MEASURE_DIMENSION;j++)
- {
- c->measure[j] = 0.0;
- }
- }
-
- for (j = 0, p = players; j < player_num; j++, p++)
- {
- c = center+p->group;
- c->group++;
- for(i = 0;i<MEASURE_DIMENSION;i++)
- {
- c->measure[i] += p->measure[i];
- }
- }
-
- for (c = center, i = 0; i < cluster_num; i++, c++)
- {
- for(j = 0;j<MEASURE_DIMENSION;j++)
- c->measure[j]/=c->group;
- }
- changed = 0;
- for (j = 0, p = players; j < player_num; j++, p++)
- {
- min_i = nearest(p, center, cluster_num, 0);
- if (min_i != p->group)
- {
- changed++;
- p->group = min_i;
- }
- }
- fprintf(stderr,"%d changed \n",changed);
- } while (changed > 2);
- for (c = center, i = 0; i < cluster_num; i++, c++)
- {
- fprintf(stderr,"\ncenter %d\n",i);
- for(j = 0;j<MEASURE_DIMENSION;j++)
- #ifdef NORMALIZE
- fprintf(stderr," %lf\t",c->measure[j]/ratio_sqrt[j]);
- #else
- fprintf(stderr," %lf\t",c->measure[j]);
- #endif
- }
- return 0;
- }
- int output_result(struct player* players,int player_num,int cluster_num)
- {
- int i ,j;
- char cmd[256] = {0};
- struct player *p =players;
- for(i =0 ; i< cluster_num;i++)
- {
- fprintf(stderr,"\nthe group %d\n",i);
- for(j=0,p=players;j<player_num;j++,p++)
- {
- if(p->group == i)
- {
- snprintf(cmd,256,"cat %s |sed -n \"%dp\"",FILEPATH,j+1);
- system(cmd);
- // fprintf(stderr,"%s\t",p->name);
- }
- }
- }
- fprintf(stderr,"\n");
- }
- int main()
- {
- struct player* players = load_data(FILEPATH,100);
- if(players == NULL)
- {
- fprintf(stderr,"load data failed\n");
- return -1;
- }
- int ret = 0;
- #ifdef NORMALIZE
- ret = calc_ratio(players,100,ratio);
- #endif
- if(ret < 0 )
- {
- fprintf(stderr,"calc ratio failed \n");
- return -2;
- }
-
- ret = K_mean_plus(players,100,8);
-
- ret = output_result(players,100,8);
- }
呵呵可以看下执行结果:
- ......
- center 0
- 4.187500 1.137500 8.687500
- center 1
- 2.982353 4.223529 8.600000
- center 2
- 6.900000 3.420000 13.860000
- center 3
- 4.360000 4.413333 19.300000
- center 4
- 1.650133 1.893333 8.166667
- center 5
- 4.075000 9.037500 14.775000
- center 6
- 2.794444 2.122222 13.694444
- center 7
- 2.857143 6.850000 14.557143
- 。。。。。。
看一下我们最后的聚类中心,center 3场均19.3分 4.4助攻这是得分狂人组织能力有限型的后卫,也就是说他们更像一个得分后卫。- the group 3
- 科比-布莱恩特 5.5000 4.6000 26.4000
- 詹姆斯-哈登 4.6000 4.5000 26.4000
- O.J.-梅奥 3.6000 2.9000 21.5000
- 斯蒂芬-库里 4.4000 5.4000 17.0000
- 凯里-欧文 4.1000 6.5000 24.3000
- 凯尔-洛瑞 5.8000 6.3000 18.3000
- 蒙塔-艾利斯 3.4000 5.4000 20.0000
- 肯巴-沃克 3.7000 5.1000 19.0000
- DeMar-DeRozan 5.1000 2.1000 20.0000
- 德怀恩-韦德 4.1000 4.9000 16.9000
- J.R.-史密斯 5.1000 3.0000 16.7000
- 雷蒙-塞申斯 3.4000 4.6000 16.3000
- 乔-约翰逊 3.9000 3.6000 16.0000
- 阿隆-阿夫拉罗 4.5000 2.4000 16.5000
- 乔治-希尔 4.2000 4.9000 14.2000
在分析下center 5,center5的聚类中心在9.03个助攻,场均得分是14.7分,这群后卫的特点是助攻狂人,得分能力不错。- center 5
- 4.075000 9.037500 14.775000
- the group 5
- 拉塞尔-威斯布鲁克 4.9000 8.5000 19.6000
- 克里斯-保罗 3.3000 10.3000 17.0000
- 托尼-帕克 3.3000 8.1000 14.3000
- 格雷维斯-瓦斯奎兹 3.9000 8.6000 11.6000
- 拉简-朗多 4.9000 12.6000 14.3000
- 朱-霍利迪 4.0000 8.6000 19.1000
- 贾米尔-尼尔森 4.0000 8.5000 11.0000
看下NBA官网的后卫助攻榜的名单,除了布兰顿詹宁斯助攻不逆天,篮板太少,被分到了group7。
- 布兰登-詹宁斯 3.1000 7.9000 16.9000
- center 5
- 4.075000 9.037500 14.775000
- center 7
- 2.857143 6.850000 14.557143
center 2也比较有意思,场均篮板6.9个,这些后卫简直就是中锋型后卫。我立刻就想到了拉简-朗多,但是很不幸,朗多的篮板没有想象的那么多,只有4.9个,因为助攻狂人被分到了group 5.
- center 2
- 6.900000 3.420000 13.860000
- the group 2
- 尼古拉斯-巴通姆 6.4000 3.1000 19.0000
- 安德鲁-伊格达拉 7.3000 4.0000 14.8000
- 泰瑞克-埃文斯 5.6000 3.4000 11.3000
- 保罗-乔治 7.6000 3.2000 13.5000
- 埃文-特纳 7.6000 3.4000 10.7000
本文似乎没有解决林书豪是什么层次的组织后卫的问题,但是主要原因是我们的原始数据太过庞杂,像科比这种球员就不应该纳入分类的范畴,因为他的数据不是典型的组织后卫。如果我能拿到联盟首发的组织后卫的数据,我们能分出3或者5个聚类,看到我们的林书豪在那个聚类中。
参考文献
1
2
阅读(986) | 评论(0) | 转发(0) |