C++,python,热爱算法和机器学习
全部博文(1214)
分类: Python/Ruby
2012-05-11 20:15:36
自世界杯开幕以来,这是首次看不到球赛的两天,看不了球,就写篇博客吧,标题比较有噱头,实际上是用R实现的item-based CF推荐算法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # 读入数据,原数据是user-subject的收藏二元组
data = read.table('data.dat', sep=',', header=TRUE) # 标识user与subject的索引 user = unique(data$user_id) subject = unique(data$subject_id) uidx = match(data$user_id, user) iidx = match(data$subject_id, subject) # 从二元组构造收藏矩阵 M = matrix(0, length(user), length(subject)) i = cbind(uidx, iidx) M[i] = 1 # 对列向量(subject向量)进行标准化,%*%为矩阵乘法 mod = colSums(M^2)^0.5 # 各列的模 MM = M %*% diag(1/mod) # M乘以由1/mod组成的对角阵,实质是各列除以该列的模 #crossprod实现MM的转置乘以MM,这里用于计算列向量的内积,S为subject的相似度矩阵 S = crossprod(MM) # user-subject推荐的分值 R = M %*% S R = apply(R, 1, FUN=sort, decreasing=TRUE, index.return=TRUE) k = 5 # 取出前5个分值最大的subject res = lapply(R, FUN=function(r)return(subject[r$ix[1:k]])) # 输出数据 write.table(paste(user, res, sep=':'), file='result.dat', quote=FALSE, row.name=FALSE, col.name=FALSE) |
除去注释,有效代码只有16行。其中大量运用了向量化的函数与处理方式,所以没有任何的显式循环结构,关于向量化更详细的叙述可看这里。
注:该代码实现的只是最基本算法,仅作参考,不承诺在大规模与复杂数据环境下的实用性。
源数据文件data.dat的内容如下所列:
user_id,subject_id1,11,31,71,132,22,52,62,72,92,102,113,13,23,33,43,73,93,105,136,16,36,46,56,86,108,18,28,38,58,68,78,89,1310,1211,211,311,411,611,811,911,1312,1213,313,613,715,415,1215,1316,216,316,416,716,817,217,317,417,517,617,717,817,917,1017,1118,218,319,219,319,519,619,919,1019,1119,1220,120,320,420,720,1321,121,621,821,921,1121,1221,1322,623,223,423,923,1224,124,524,925,225,625,1025,1126,226,326,827,327,627,1227,1328,128,228,328,528,728,928,1028,1128,1228,1329,129,229,329,429,529,629,729,829,929,1030,630,730,930,1331,631,1132,132,533,233,1334,334,734,834,934,1034,1335,335,435,535,635,736,236,336,436,636,736,836,936,1136,1236,1338,541,141,341,441,541,641,741,1142,242,342,742,842,942,1042,1143,243,643,1043,1143,12