文章来源:
http://dev.bizo.com/2012/01/clustering-of-sparse-data-using-python.html
Coming from a Matlab background, I found sparse matrices to be easy to use and well integrated into the language. However, when transitioning to python’s scientific computing ecosystem, I had a harder time using sparse matrices. This post is intended to help Matlab refugees and those interested in using sparse matricies in python (in particular, for clustering)
Requirements:
-
scikit-learn (2.10+)
-
numpy (refer to scikit-learn version requirements)
-
scipy (refer to scikit-learn version requirements)
Sparse Matrix Types:
There are six types of sparse matrices implemented under scipy:*bsr_matrix -- block sparse row matrix
-
bsr_matrix -- block sparse row matrix
-
coo_matrix -- sparse matrix in coordinate format
-
csc_matrix -- compressed sparse column matrix
-
csr_matrix -- compressed sparse row matrix
-
dia_matrix -- sparse matrix with diagonal storage
-
dok_matrix -- dictionary of keys based sparse matrix
-
lil_matrix -- row-based linked list sparse matrix
For more info see: ()
When to use which matrix:
The following are scenarios when you would want to choose one sparse matrix type over the another:
-
Fast Arithmetic Operation: cscmatrix, csrmatrix
-
Fast Column Slicing (e.g., A[:, 1:2]): csc_matrix
-
Fast Row Slicing (e.g., A[1:2, :]) csr_matrix
-
Fast Matrix vector products: csrmatrix, bsrmatrix, csc_matrix
-
Fast Changing of sparsity (e.g., adding entries to matrix): lilmatrix, dokmatrix
-
Fast conversion to other sparse formats: coo_matrix
-
Constructing Large Sparse Matrices: coo_matrix
Clustering with scikit-learn:
With the release of scikit-learn 2.10, one of the useful new features is the support for sparse matrices with the k-means algorithm. The following is how you would use sparse matrices with k-means:
Full Matrix to Sparse Matrix
-
from numpy.random import random
-
from scipy.sparse import *
-
from sklearn.cluster import KMeans
-
-
# create a 30x1000 dense matrix random matrix.
-
D = random((30,1000))
-
# keep entries with value < 0.10 (10% of entries in matrix will be non-zero)
-
# X is a "full" matrix that is intrinsically sparse.
-
X = D*(D<0.10) # note: element wise mult
-
-
# convert D into a sparse matrix (type coo_matrix)
-
# note: we can initialize any type of sparse matrix.
-
# There is no particular motivation behind using
-
# coo_matrix for this example.
-
S = coo_matrix(X)
-
-
labeler = KMeans(k=3)
-
# convert coo to csr format
-
# note: Kmeans currently only works with CSR type sparse matrix
-
labeler.fit(S.tocsr())
-
-
# print cluster assignments for each row
-
for (row, label) in enumerate(labeler.labels_):
-
print "row %d has label %d"%(row, label)
One of the issues with Example-1 is that we are constructing a sparse matrix from a full matrix. It will often be the case that we will not be able to fit a full (although intrinsically sparse) matrix in memory. For example, if the matrix X was a 100000x1000000000 full matrix, there could be some issues. One solution to this is to somehow extract out the non-zero entries of X and to use a smarter constructor for the sparse matrix.
Sparse Matrix Construction
In Example-2, we will assume that we have X's data stored on some file on disk. In particular, we will assume that X is stored in a csv file and that we are able to extract out the non-zero data efficiently.
-
import csv
-
from scipy.sparse import *
-
from sklearn.cluster import KMeans
-
-
def extract_nonzero(fname):
-
"""
-
extracts nonzero entries from a csv file
-
input: fname (str) -- path to csv file
-
output: generator<(int, int, float)> -- generator
-
producing 3-tuple containing (row-index, column-index, data)
-
"""
-
for (rindex,row) in enumerate(csv.reader(open(fname))):
-
for (cindex, data) in enumerate(row):
-
if data!="0":
-
yield (rindex, cindex, float(data))
-
-
def get_dimensions(fname):
-
"""
-
determines the dimension of a csv file
-
input: fname (str) -- path to csv file
-
output: (nrows, ncols) -- tuple containing row x col data
-
"""
-
rowgen = (row for row in csv.reader(open(fname)))
-
# compute col size
-
colsize = len(rowgen.next())
-
# compute row size
-
rowsize = 1 + sum(1 for row in rowgen)
-
return (rowsize, colsize)
-
-
# obtain dimensions of data
-
(rdim, cdim) = get_dimensions("X.csv")
-
-
# allocate a lil_matrix of size (rdim by cdim)
-
# note: lil_matrix is used since we be modifying
-
# the matrix a lot.
-
S = lil_matrix((rdim, cdim))
-
-
# add data to S
-
for (i,j,d) in extract_nonzero("X.csv"):
-
S[i,j] = d
-
-
# perform clustering
-
labeler = KMeans(k=3)
-
# convert lil to csr format
-
# note: Kmeans currently only works with CSR type sparse matrix
-
labeler.fit(S.tocsr())
-
-
# print cluster assignments for each row
-
for (row, label) in enumerate(labeler.labels_):
-
print "row %d has label %d"%(row, label)
What to do when Sparse Matrices aren't supported:
When sparse matrices aren't supported, one solution is to convert the matrix to a full matrix. To do this, simply invoke the todense() method.
阅读(1178) | 评论(0) | 转发(0) |