程序员训练机器学习 SVM算法分享-laoliulaoliu-ChinaUnix博客

miraclemiracle.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

laoliulaoliu

博客访问： 4663521
博文数量： 1214
博客积分： 13195
博客等级：上将
技术积分： 9105
用户组：普通用户
注册时间： 2007-01-19 14:41

个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文（1214）

cloud（3）
operation（9）
tornado（4）
mac_os（1）
golang（4）
架构（13）
git（4）
security（29）
shell（1）
macbook（1）
ruby（13）
javascript（15）
design（3）
testing（1）
mac（1）
bigdata（69）
nosql（46）
R（9）
gcj/acm（6）
NLP（10）
小说（3）
matlab（4）
web（44）
java（66）
product（7）
c#（1）
language（4）
machine learning（76）
science（4）
opencourse（2）
windows（3）
search（33）
algorithm（65）
database（51）
compiler（11）
ACE（5）
poem（1）
programming（29）
python（140）
assembly（1）
linux（49）
C++（16）
book（2）
cate（1）
phliosophy（3）
mental（30）
Science fiction（1）
Software（5）
c（23）
network（65）
CS（15）
thinking（10）
BSD（13）
solaris10（2）
life（57）
Debian（16）
economy（7）
Mathematics（57）
OS（8）
ibm（2）
gentoo（32）
未分配的博文（8）

文章存档

2021年（13）

2020年（49）

2019年（14）

2018年（27）

2017年（69）

2016年（100）

2015年（106）

2014年（240）

2013年（5）

2012年（193）

2011年（155）

2010年（93）

2009年（62）

2008年（51）

2007年（37）

我的朋友

相关博文

程序员训练机器学习 SVM算法分享

分类：云计算

2014-06-30 23:35:05

文章来源：

摘要：支持向量机（SVM）已经成为一种非常受欢迎的算法。本文主要阐述了SVM是如何进行工作的，同时也给出了使用Python Scikits库的几个示例。SVM作为一种训练机器学习的算法，可以用于解决分类和回归问题，还使用了kernel trick技术进行数据的转换，再根据转换信息在可能的输出之中找到一个最优的边界。

【CSDN报道】支持向量机（Support Vector Machine）已经成为一种非常受欢迎的算法。在这篇文章里，Greg Lamp简单解释了它是如何进行工作的，同时他也给出了使用Python Scikits库的几个示例。所有代码在Github上都是可用的，Greg Lamp以后还会对使用Scikits以及Sklearn的细节问题进行更深一步的阐述。CSDN对本篇技术性文章进行了编译整理：

SVM是什么？

SVM是一种训练机器学习的算法，可以用于解决分类和回归问题，同时还使用了一种称之为kernel trick的技术进行数据的转换，然后再根据这些转换信息，在可能的输出之中找到一个最优的边界。简单来说，就是做一些非常复杂的数据转换工作，然后根据预定义的标签或者输出进而计算出如何分离用户的数据。

是什么让它变得如此的强大？

当然，对于SVM来说，完全有能力实现分类以及回归。在这篇文章中，Greg Lamp主要关注如何使用SVM进行分类，特别是非线性的SVM或者SVM使用非线性内核。非线性SVM意味着该算法计算的边界没有必要是一条直线，这样做的好处在于，可以捕获更多数据点集之间的复杂关系，而无需靠用户自己来执行困难的转换。其缺点就是由于更多的运算量，训练的时间要长很多。

什么是kernel trick？

kernel trick对接收到的数据进行转换：输入一些你认为比较明显的特征进行分类，输出一些你完全不认识的数据，这个过程就像解开一个DNA链。你开始是寻找数据的矢量，然后把它传给kernel trick，再进行不断的分解和重组直到形成一个更大的数据集，而且通常你看到的这些数据非常的难以理解。这就是神奇之处，扩展的数据集拥有更明显的边界，SVM算法也能够计算一个更加优化的超平面。

其次，假设你是一个农场主，现在你有一个问题——你需要搭建一个篱笆来防止狼对牛群造成伤害。但是篱笆应该建在哪里呢？如果你是一个以数据为驱动的农场主，那么你就需要在你的牧场上，依据牛群和狼群的位置建立一个“分类器”，比较这几种（如下图所示）不同的分类器，我们可以看到SVM完成了一个很完美的解决方案。Greg Lamp认为这个故事漂亮的说明了使用非线性分类器的优势。显而易见，逻辑模式以及决策树模式都是使用了直线方法。

实现代码如下：farmer.py Python

	
		
		
			import numpy as np 
		

		
			import pylab as pl 
		

		
			from sklearn import svm 
		

		
			from sklearn import linear_model 
		

		
			from sklearn import tree 
		

		
			import pandas as pd 
		

		
			  
		

		
			  
		

		
			def plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr): 
		

		
			    x_min, x_max = df.x.min() - .5, df.x.max() + .5 
		

		
			    y_min, y_max = df.y.min() - .5, df.y.max() + .5 
		

		
			  
		

		
			    # step between points. i.e. [0, 0.02, 0.04, ...] 
		

		
			    step = .02 
		

		
			    # to plot the boundary, we're going to create a matrix of every possible point 
		

		
			    # then label each point as a wolf or cow using our classifier 
		

		
			    xx, yy = np.meshgrid(np.arange(x_min, x_max, step),
		

		
			np.arange(y_min, y_max, step)) 
		

		
			    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) 
		

		
			    # this gets our predictions back into a matrix 
		

		
			    ZZ = Z.reshape(xx.shape) 
		

		
			  
		

		
			    # create a subplot (we're going to have more than 1 plot on a given image) 
		

		
			    pl.subplot(2, 2, plt_nmbr) 
		

		
			    # plot the boundaries 
		

		
			    pl.pcolormesh(xx, yy, Z, cmap=pl.cm.Paired) 
		

		
			  
		

		
			    # plot the wolves and cows 
		

		
			    for animal in df.animal.unique(): 
		

		
			        pl.scatter(df[df.animal==animal].x, 
		

		
			                   df[df.animal==animal].y, 
		

		
			                   marker=animal, 
		

		
			                   label="cows" if animal=="x" else "wolves", 
		

		
			                   color='black', 
		

		
			                   c=df.animal_type, cmap=pl.cm.Paired) 
		

		
			    pl.title(clf_name) 
		

		
			    pl.legend(loc="best") 
		

		
			  
		

		
			  
		

		
			data = open("cows_and_wolves.txt").read() 
		

		
			data = [row.split('\t') for row in data.strip().split('\n')] 
		

		
			  
		

		
			animals = [] 
		

		
			for y, row in enumerate(data): 
		

		
			    for x, item in enumerate(row): 
		

		
			        # x's are cows, o's are wolves 
		

		
			        if item in ['o', 'x']: 
		

		
			            animals.append([x, y, item]) 
		

		
			  
		

		
			df = pd.DataFrame(animals, columns=["x", "y", "animal"]) 
		

		
			df['animal_type'] = df.animal.apply(lambda x: 0 if x=="x" else 1) 
		

		
			  
		

		
			# train using the x and y position coordiantes 
		

		
			train_cols = ["x", "y"] 
		

		
			  
		

		
			clfs = { 
		

		
			    "SVM": svm.SVC(), 
		

		
			    "Logistic" : linear_model.LogisticRegression(), 
		

		
			    "Decision Tree": tree.DecisionTreeClassifier(), 
		

		
			} 
		

		
			  
		

		
			plt_nmbr = 1 
		

		
			for clf_name, clf in clfs.iteritems(): 
		

		
			    clf.fit(df[train_cols], df.animal_type) 
		

		
			    plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr) 
		

		
			    plt_nmbr += 1 
		

		
			pl.show()

让SVM做一些更难的工作吧！

诚然，如果自变量和因变量之间的关系是非线性的，是很难接近SVM的准确性。如果还是难以理解的话，可以看看下面的例子：假设我们有一组数据集，它包含了绿色以及红色的点集。我们首先标绘一下它们的坐标，这些点集构成了一个具体的形状——拥有着红色的轮廓，周围充斥着绿色（看起来就像孟加拉国的国旗）。如果因为某些原因，我们丢失了数据集当中1/3的部分，那么在我们恢复的时候，我们就希望寻找一种方法，最大程度地实现这丢失1/3部分的轮廓。

那么我们如何推测这丢失1/3的部分最接近什么形状？一种方式就是建立一种模型，使用剩下接近80%的数据信息作为一个“训练集”。Greg Lamp选择三种不同的数据模型分别做了尝试：

逻辑模型(GLM)
决策树模型(DT)
SVM

Greg Lamp对每种数据模型都进行了训练，然后再利用这些模型推测丢失1/3部分的数据集。我们可以看看这些不同模型的推测结果：

实现代码如下：svmflag.py Python

	
		
		
			import numpy as np 
		

		
			import pylab as pl 
		

		
			import pandas as pd 
		

		
			  
		

		
			from sklearn import svm 
		

		
			from sklearn import linear_model 
		

		
			from sklearn import tree 
		

		
			  
		

		
			from sklearn.metrics import confusion_matrix 
		

		
			  
		

		
			x_min, x_max = 0, 15 
		

		
			y_min, y_max = 0, 10 
		

		
			step = .1 
		

		
			# to plot the boundary, we're going to create a matrix of every possible point 
		

		
			# then label each point as a wolf or cow using our classifier 
		

		
			xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step)) 
		

		
			  
		

		
			df = pd.DataFrame(data={'x': xx.ravel(), 'y': yy.ravel()}) 
		

		
			  
		

		
			df['color_gauge'] = (df.x-7.5)**2 + (df.y-5)**2 
		

		
			df['color'] = df.color_gauge.apply(lambda x: "red" if x <= 15 else "green") 
		

		
			df['color_as_int'] = df.color.apply(lambda x: 0 if x=="red" else 1) 
		

		
			  
		

		
			print "Points on flag:" 
		

		
			print df.groupby('color').size() 
		

		
			print 
		

		
			  
		

		
			figure = 1 
		

		
			  
		

		
			# plot a figure for the entire dataset 
		

		
			for color in df.color.unique(): 
		

		
			    idx = df.color==color 
		

		
			    pl.subplot(2, 2, figure) 
		

		
			    pl.scatter(df[idx].x, df[idx].y, colorcolor=color) 
		

		
			    pl.title('Actual') 
		

		
			  
		

		
			  
		

		
			train_idx = df.x < 10 
		

		
			  
		

		
			train = df[train_idx] 
		

		
			test = df[-train_idx] 
		

		
			  
		

		
			  
		

		
			print "Training Set Size: %d" % len(train) 
		

		
			print "Test Set Size: %d" % len(test) 
		

		
			  
		

		
			# train using the x and y position coordiantes 
		

		
			cols = ["x", "y"] 
		

		
			  
		

		
			clfs = { 
		

		
			    "SVM": svm.SVC(degree=0.5), 
		

		
			    "Logistic" : linear_model.LogisticRegression(), 
		

		
			    "Decision Tree": tree.DecisionTreeClassifier() 
		

		
			} 
		

		
			  
		

		
			  
		

		
			# racehorse different classifiers and plot the results 
		

		
			for clf_name, clf in clfs.iteritems(): 
		

		
			    figure += 1 
		

		
			  
		

		
			    # train the classifier 
		

		
			    clf.fit(train[cols], train.color_as_int) 
		

		
			  
		

		
			    # get the predicted values from the test set 
		

		
			    test['predicted_color_as_int'] = clf.predict(test[cols]) 
		

		
			    test['pred_color']  
		

		
			 = test.predicted_color_as_int.apply(lambda x: "red" if x==0 else "green") 
		

		
			     
		

		
			    # create a new subplot on the plot 
		

		
			    pl.subplot(2, 2, figure) 
		

		
			    # plot each predicted color 
		

		
			    for color in test.pred_color.unique(): 
		

		
			        # plot only rows where pred_color is equal to color 
		

		
			        idx = test.pred_color==color 
		

		
			        pl.scatter(test[idx].x, test[idx].y, colorcolor=color) 
		

		
			  
		

		
			    # plot the training set as well 
		

		
			    for color in train.color.unique(): 
		

		
			        idx = train.color==color 
		

		
			        pl.scatter(train[idx].x, train[idx].y, colorcolor=color) 
		

		
			  
		

		
			    # add a dotted line to show the boundary between the training and test set 
		

		
			    # (everything to the right of the line is in the test set) 
		

		
			    #this plots a vertical line 
		

		
			    train_line_y = np.linspace(y_min, y_max) #evenly spaced array from 0 to 10 
		

		
			    train_line_x = np.repeat(10, len(train_line_y))
		

		
			  #repeat 10 (threshold for traininset) n times 
		

		
			    # add a black, dotted line to the subplot 
		

		
			    pl.plot(train_line_x, train_line_y, 'k--', color="black") 
		

		
			     
		

		
			    pl.title(clf_name) 
		

		
			  
		

		
			    print "Confusion Matrix for %s:" % clf_name 
		

		
			    print confusion_matrix(test.color, test.pred_color) 
		

		
			pl.show()

结论：

从这些实验结果来看，毫无疑问，SVM是绝对的优胜者。但是究其原因我们不妨看一下DT模型和GLM模型。很明显，它们都是使用的直线边界。Greg Lamp的输入模型在计算非线性的x, y以及颜色之间的关系时，并没有包含任何的转换信息。假如Greg Lamp它们能够定义一些特定的转换信息，可以使GLM模型和DT模型能够输出更好的效果，他们为什么要浪费时间呢？其实并没有复杂的转换或者压缩，SVM仅仅分析错了117/5000个点集（高达98%的准确率，对比而言，DT模型是51%，而GLM模型只有12%！）

局限性在哪里？

很多人都有疑问，既然SVM这么强大，但是为什么不能对一切使用SVM呢？很不幸，SVM最神奇的地方恰好也是它最大的软肋！复杂的数据转换信息和边界的产生结果都难以进行阐述。这也是它常常被称之为“black box”的原因，而GLM模型和DT模型刚好相反，它们很容易进行理解。（编译/@CSDN王鹏，审校/仲浩）
本文为CSDN编译整理，未经允许不得转载。如需转载请联系market@csdn.net

阅读(1343) | 评论(0) | 转发(0) |

上一篇：机器学习中的算法(2)-支持向量机(SVM)基础

下一篇：在ubuntu上配置redmine

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6