Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4601730
  • 博文数量: 1214
  • 博客积分: 13195
  • 博客等级: 上将
  • 技术积分: 9105
  • 用 户 组: 普通用户
  • 注册时间: 2007-01-19 14:41
个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文(1214)

文章存档

2021年(13)

2020年(49)

2019年(14)

2018年(27)

2017年(69)

2016年(100)

2015年(106)

2014年(240)

2013年(5)

2012年(193)

2011年(155)

2010年(93)

2009年(62)

2008年(51)

2007年(37)

分类: IT业界

2015-10-09 14:04:41

文章来源:http://dev.bizo.com/2012/01/clustering-of-sparse-data-using-python.html

Coming from a Matlab background, I found sparse matrices to be easy to use and well integrated into the language. However, when transitioning to python’s scientific computing ecosystem, I had a harder time using sparse matrices. This post is intended to help Matlab refugees and those interested in using sparse matricies in python (in particular, for clustering)

Requirements:

  • scikit-learn (2.10+)
  • numpy (refer to scikit-learn version requirements)
  • scipy (refer to scikit-learn version requirements)

Sparse Matrix Types:

There are six types of sparse matrices implemented under scipy:*bsr_matrix -- block sparse row matrix

  • bsr_matrix -- block sparse row matrix
  • coo_matrix -- sparse matrix in coordinate format
  • csc_matrix -- compressed sparse column matrix
  • csr_matrix -- compressed sparse row matrix
  • dia_matrix -- sparse matrix with diagonal storage
  • dok_matrix -- dictionary of keys based sparse matrix
  • lil_matrix -- row-based linked list sparse matrix

For more info see: ()

When to use which matrix:

The following are scenarios when you would want to choose one sparse matrix type over the another:

  • Fast Arithmetic Operation: cscmatrix, csrmatrix
  • Fast Column Slicing (e.g., A[:, 1:2]): csc_matrix
  • Fast Row Slicing (e.g., A[1:2, :]) csr_matrix
  • Fast Matrix vector products: csrmatrix, bsrmatrix, csc_matrix
  • Fast Changing of sparsity (e.g., adding entries to matrix): lilmatrix, dokmatrix
  • Fast conversion to other sparse formats: coo_matrix
  • Constructing Large Sparse Matrices: coo_matrix

Clustering with scikit-learn:

With the release of scikit-learn 2.10, one of the useful new features is the support for sparse matrices with the k-means algorithm. The following is how you would use sparse matrices with k-means:

Full Matrix to Sparse Matrix


  1. from numpy.random import random
  2. from scipy.sparse import *
  3. from sklearn.cluster import KMeans

  4. # create a 30x1000 dense matrix random matrix.
  5. D = random((30,1000))
  6. # keep entries with value < 0.10 (10% of entries in matrix will be non-zero)
  7. # X is a "full" matrix that is intrinsically sparse.
  8. X = D*(D<0.10) # note: element wise mult

  9. # convert D into a sparse matrix (type coo_matrix)
  10. # note: we can initialize any type of sparse matrix.
  11. # There is no particular motivation behind using
  12. # coo_matrix for this example.
  13. S = coo_matrix(X)

  14. labeler = KMeans(k=3)
  15. # convert coo to csr format
  16. # note: Kmeans currently only works with CSR type sparse matrix
  17. labeler.fit(S.tocsr())

  18. # print cluster assignments for each row
  19. for (row, label) in enumerate(labeler.labels_):
  20.   print "row %d has label %d"%(row, label)

One of the issues with Example-1 is that we are constructing a sparse matrix from a full matrix. It will often be the case that we will not be able to fit a full (although intrinsically sparse) matrix in memory. For example, if the matrix X was a 100000x1000000000 full matrix, there could be some issues. One solution to this is to somehow extract out the non-zero entries of X and to use a smarter constructor for the sparse matrix.

Sparse Matrix Construction

In Example-2, we will assume that we have X's data stored on some file on disk. In particular, we will assume that X is stored in a csv file and that we are able to extract out the non-zero data efficiently.

  1. import csv
  2. from scipy.sparse import *
  3. from sklearn.cluster import KMeans

  4. def extract_nonzero(fname):
  5.   """
  6.   extracts nonzero entries from a csv file
  7.   input: fname (str) -- path to csv file
  8.   output: generator<(int, int, float)> -- generator
  9.           producing 3-tuple containing (row-index, column-index, data)
  10.   """
  11.   for (rindex,row) in enumerate(csv.reader(open(fname))):
  12.     for (cindex, data) in enumerate(row):
  13.       if data!="0":
  14.         yield (rindex, cindex, float(data))

  15. def get_dimensions(fname):
  16.   """
  17.   determines the dimension of a csv file
  18.   input: fname (str) -- path to csv file
  19.   output: (nrows, ncols) -- tuple containing row x col data
  20.   """
  21.   rowgen = (row for row in csv.reader(open(fname)))
  22.   # compute col size
  23.   colsize = len(rowgen.next())
  24.   # compute row size
  25.   rowsize = 1 + sum(1 for row in rowgen)
  26.   return (rowsize, colsize)

  27. # obtain dimensions of data
  28. (rdim, cdim) = get_dimensions("X.csv")

  29. # allocate a lil_matrix of size (rdim by cdim)
  30. # note: lil_matrix is used since we be modifying
  31. # the matrix a lot.
  32. S = lil_matrix((rdim, cdim))

  33. # add data to S
  34. for (i,j,d) in extract_nonzero("X.csv"):
  35.   S[i,j] = d

  36. # perform clustering
  37. labeler = KMeans(k=3)
  38. # convert lil to csr format
  39. # note: Kmeans currently only works with CSR type sparse matrix
  40. labeler.fit(S.tocsr())

  41. # print cluster assignments for each row
  42. for (row, label) in enumerate(labeler.labels_):
  43.   print "row %d has label %d"%(row, label)

What to do when Sparse Matrices aren't supported:

When sparse matrices aren't supported, one solution is to convert the matrix to a full matrix. To do this, simply invoke the todense() method.

阅读(1178) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~