Chinaunix首页 | 论坛 | 博客
  • 博客访问: 469188
  • 博文数量: 153
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 1575
  • 用 户 组: 普通用户
  • 注册时间: 2016-12-20 17:02
文章分类

全部博文(153)

文章存档

2017年(111)

2016年(42)

我的朋友

分类: 大数据

2016-12-30 14:24:32

我最近换了个工作,在入职之前,我看了很多电影来打发闲暇时光。然而,演员之间的联系引起了我的注意。我经常回想,为什么我总能看到他们在一起工作呢?然后,我就使用 IMDB 电影数据库来分析演员之间的关联。

# Load up the useful libraries for building and visualizing networks: library(reshape2) library(network) library(sna) library(ggplot2) library(GGally) library(readr)

system("ls ../input")

# Read in the data, and strip out the unnecessary details:
data <- read_csv("../input/movie_metadata.csv")
network: Classes for Relational Data
Version 1.13.0 created on 2015-08-31.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
                    Mark S. Handcock, University of California -- Los Angeles David R. Hunter, Penn State University
                    Martina Morris, University of Washington
                    Skye Bender-deMoll, University of Washington For citation information, type citation("network"). Type help("network-package") to get started.

Loading required package: statnet.common
sna: Tools for Social Network Analysis
Version 2.4 created on 2016-07-23.
copyright (c) 2005, Carter T. Butts, University of California-Irvine For citation information, type citation("sna"). Type help(package="sna") to get started.

Parsed with column specification:
cols(
  .default = col_integer(),
  color = col_character(),
  director_name = col_character(),
  actor_2_name = col_character(),
  genres = col_character(),
  actor_1_name = col_character(),
  movie_title = col_character(),
  actor_3_name = col_character(),
  plot_keywords = col_character(),
  movie_imdb_link = col_character(),
  language = col_character(),
  country = col_character(),
  content_rating = col_character(),
  imdb_score = col_double(),
  aspect_ratio = col_double()
)
See spec(...) for full column specifications. Warning message:
“4 parsing failures.
 row    col   expected      actual 2324 budget an integer 2400000000 2989 budget an integer 12215500000 3006 budget an integer 2500000000 3860 budget an integer 4200000000

需要做的第一件事就是构建可生成网络图的对象。我们有很多库可以完成这件事情,而我选择了sna和network两个库。另外,我推荐包含GGally的GGplot2库可以将网络完美地可视化。我决定对这些进行一层简单的封装来获得我们需要的对象。

getIMDBGraph<-function(data, 
                       firstyear =1, 
                       lastyear = 3000, 
                       genre = FALSE, 
                       minscore = 0, 
                       maxscore = 10){ if(is.character(genre)){ data<-data[grep(genre, data$genres),]} data<-subset(data, data$title_year >= firstyear & data$title_year <= lastyear & data$imdb_score <= maxscore & data$imdb_score >= minscore)
  df<- data.frame(data$movie_title, data$actor_1_name, data$actor_2_name, data$actor_3_name)
  df<- melt(df, id.vars = 'data.movie_title')
  names(df)<-c('title', 'actornum', 'actor')
  df<-df[,c(1,3)]
  edges<-merge(x = df, 
               y = df, 
               by = 'title')
  edges<-subset(edges,edges$actor.x != edges$actor.y)
  edgelist<-as.matrix(edges[,c(2:3)])
  graph<-network(edgelist) return(graph)} 

现在就可以传递之前导入的按时间排序的数据了。在这次案例中,我使用的数据是2000年到2009年的。


阅读全文请点击:
阅读(1985) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~