Clustering of sparse data using python with scikit-learn-laoliulaoliu-ChinaUnix博客

miraclemiracle.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

laoliulaoliu

博客访问： 4664100
博文数量： 1214
博客积分： 13195
博客等级：上将
技术积分： 9105
用户组：普通用户
注册时间： 2007-01-19 14:41

个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文（1214）

cloud（3）
operation（9）
tornado（4）
mac_os（1）
golang（4）
架构（13）
git（4）
security（29）
shell（1）
macbook（1）
ruby（13）
javascript（15）
design（3）
testing（1）
mac（1）
bigdata（69）
nosql（46）
R（9）
gcj/acm（6）
NLP（10）
小说（3）
matlab（4）
web（44）
java（66）
product（7）
c#（1）
language（4）
machine learning（76）
science（4）
opencourse（2）
windows（3）
search（33）
algorithm（65）
database（51）
compiler（11）
ACE（5）
poem（1）
programming（29）
python（140）
assembly（1）
linux（49）
C++（16）
book（2）
cate（1）
phliosophy（3）
mental（30）
Science fiction（1）
Software（5）
c（23）
network（65）
CS（15）
thinking（10）
BSD（13）
solaris10（2）
life（57）
Debian（16）
economy（7）
Mathematics（57）
OS（8）
ibm（2）
gentoo（32）
未分配的博文（8）

文章存档

2021年（13）

2020年（49）

2019年（14）

2018年（27）

2017年（69）

2016年（100）

2015年（106）

2014年（240）

2013年（5）

2012年（193）

2011年（155）

2010年（93）

2009年（62）

2008年（51）

2007年（37）

我的朋友

Requirements:

scikit-learn (2.10+)
numpy (refer to scikit-learn version requirements)
scipy (refer to scikit-learn version requirements)

Sparse Matrix Types:

There are six types of sparse matrices implemented under scipy:*bsr_matrix -- block sparse row matrix

bsr_matrix -- block sparse row matrix
coo_matrix -- sparse matrix in coordinate format
csc_matrix -- compressed sparse column matrix
csr_matrix -- compressed sparse row matrix
dia_matrix -- sparse matrix with diagonal storage
dok_matrix -- dictionary of keys based sparse matrix
lil_matrix -- row-based linked list sparse matrix

For more info see: ()

When to use which matrix:

The following are scenarios when you would want to choose one sparse matrix type over the another:

Fast Arithmetic Operation: cscmatrix, csrmatrix
Fast Column Slicing (e.g., A[:, 1:2]): csc_matrix
Fast Row Slicing (e.g., A[1:2, :]) csr_matrix
Fast Matrix vector products: csrmatrix, bsrmatrix, csc_matrix
Fast Changing of sparsity (e.g., adding entries to matrix): lilmatrix, dokmatrix
Fast conversion to other sparse formats: coo_matrix
Constructing Large Sparse Matrices: coo_matrix

Clustering with scikit-learn:

With the release of scikit-learn 2.10, one of the useful new features is the support for sparse matrices with the k-means algorithm. The following is how you would use sparse matrices with k-means:

Full Matrix to Sparse Matrix

from numpy.random import random
from scipy.sparse import *
from sklearn.cluster import KMeans
# create a 30x1000 dense matrix random matrix.
D = random((30,1000))
# keep entries with value < 0.10 (10% of entries in matrix will be non-zero)
# X is a "full" matrix that is intrinsically sparse.
X = D*(D<0.10) # note: element wise mult
# convert D into a sparse matrix (type coo_matrix)
# note: we can initialize any type of sparse matrix.
# There is no particular motivation behind using
# coo_matrix for this example.
S = coo_matrix(X)
labeler = KMeans(k=3)
# convert coo to csr format
# note: Kmeans currently only works with CSR type sparse matrix
labeler.fit(S.tocsr())
# print cluster assignments for each row
for (row, label) in enumerate(labeler.labels_):
print "row %d has label %d"%(row, label)

One of the issues with Example-1 is that we are constructing a sparse matrix from a full matrix. It will often be the case that we will not be able to fit a full (although intrinsically sparse) matrix in memory. For example, if the matrix X was a 100000x1000000000 full matrix, there could be some issues. One solution to this is to somehow extract out the non-zero entries of X and to use a smarter constructor for the sparse matrix.

In Example-2, we will assume that we have X's data stored on some file on disk. In particular, we will assume that X is stored in a csv file and that we are able to extract out the non-zero data efficiently.

import csv
from scipy.sparse import *
from sklearn.cluster import KMeans
def extract_nonzero(fname):
"""
extracts nonzero entries from a csv file
input: fname (str) -- path to csv file
output: generator<(int, int, float)> -- generator
producing 3-tuple containing (row-index, column-index, data)
"""
for (rindex,row) in enumerate(csv.reader(open(fname))):
for (cindex, data) in enumerate(row):
if data!="0":
yield (rindex, cindex, float(data))
def get_dimensions(fname):
"""
determines the dimension of a csv file
input: fname (str) -- path to csv file
output: (nrows, ncols) -- tuple containing row x col data
"""
rowgen = (row for row in csv.reader(open(fname)))
# compute col size
colsize = len(rowgen.next())
# compute row size
rowsize = 1 + sum(1 for row in rowgen)
return (rowsize, colsize)
# obtain dimensions of data
(rdim, cdim) = get_dimensions("X.csv")
# allocate a lil_matrix of size (rdim by cdim)
# note: lil_matrix is used since we be modifying
# the matrix a lot.
S = lil_matrix((rdim, cdim))
# add data to S
for (i,j,d) in extract_nonzero("X.csv"):
S[i,j] = d
# perform clustering
labeler = KMeans(k=3)
# convert lil to csr format
# note: Kmeans currently only works with CSR type sparse matrix
labeler.fit(S.tocsr())
# print cluster assignments for each row
for (row, label) in enumerate(labeler.labels_):
print "row %d has label %d"%(row, label)

What to do when Sparse Matrices aren't supported:

When sparse matrices aren't supported, one solution is to convert the matrix to a full matrix. To do this, simply invoke the todense() method.

阅读(1195) | 评论(0) | 转发(0) |

上一篇：浅谈协方差矩阵

下一篇：Apache Spark源码走读之1 -- Spark论文阅读笔记

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们