Chinaunix首页 | 论坛 | 博客
  • 博客访问: 471245
  • 博文数量: 153
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 1575
  • 用 户 组: 普通用户
  • 注册时间: 2016-12-20 17:02
文章分类

全部博文(153)

文章存档

2017年(111)

2016年(42)

我的朋友

分类: IT业界

2016-12-29 16:04:26

  • K-均值(K-mean)聚类 目的:最小化所有类簇中的方差之和
    • 类簇内方差和(WCSS,within cluster sum of squared errors)
    • fuzzy K-means
  • 层次聚类(hierarchical culstering)
    • 凝聚聚类(agglomerative clustering)
    • 分列式聚类(divisive clustering)

0 运行环境

cd $SPARK_HOME bin/spark-shell --name my_mlib --packages org.jblas:jblas:1.2.4 --driver-memory 4G --executor-memory 4G --driver-cores 2 
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.clustering.KMeans import breeze.linalg._
import breeze.numerics.pow 

1 提取特征

val PATH = "/Users/erichan/sourcecode/book/Spark机器学习" val movies = sc.textFile(PATH+"/ml-100k/u.item") println(movies.first) 

1|Toy Story (1995)|01-Jan-1995||

提取标签

val genres = sc.textFile(PATH+"/ml-100k/u.genre")
genres.take(5).foreach(println) 

unknown|0
Action|1
Adventure|2
Animation|3
Children's|4

val genreMap = genres.filter(!_.isEmpty).map(line => line.split("\\|")).map(array => (array(1), array(0))).collectAsMap println(genreMap) 

Map(2 -> Adventure, 5 -> Comedy, 12 -> Musical, 15 -> Sci-Fi, 8 -> Drama, 18 -> Western, 7 -> Documentary, 17 -> War, 1 -> Action, 4 -> Children's, 11 -> Horror, 14 -> Romance, 6 -> Crime, 0 -> unknown, 9 -> Fantasy, 16 -> Thriller, 3 -> Animation, 10 -> Film-Noir, 13 -> Mystery)

val titlesAndGenres = movies.map(_.split("\\|")).map { array =>
    val genres = array.toSeq.slice(5, array.size)
    val genresAssigned = genres.zipWithIndex.filter { case (g, idx) =>
        g == "1" }.map { case (g, idx) =>
        genreMap(idx.toString)
    }
    (array(0).toInt, (array(1), genresAssigned))
} println(titlesAndGenres.first) 

(1,(Toy Story (1995),ArrayBuffer(Animation, Children's, Comedy)))

训练推荐模型

val rawData = sc.textFile(PATH+"/ml-100k/u.data")
val rawRatings = rawData.map(_.split("\t").take(3))
val ratings = rawRatings.map{ case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)

val movieFactors = alsModel.productFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
val movieVectors = movieFactors.map(_._2)
val userFactors = alsModel.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
val userVectors = userFactors.map(_._2) 

归一化


阅读全文请点击:
阅读(1103) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~