分类: 大数据
2016-12-30 14:24:32
我最近换了个工作,在入职之前,我看了很多电影来打发闲暇时光。然而,演员之间的联系引起了我的注意。我经常回想,为什么我总能看到他们在一起工作呢?然后,我就使用 IMDB 电影数据库来分析演员之间的关联。
# Load up the useful libraries for building and visualizing networks: library(reshape2) library(network) library(sna) library(ggplot2) library(GGally) library(readr)
system("ls ../input")
# Read in the data, and strip out the unnecessary details:
data <- read_csv("../input/movie_metadata.csv")
network: Classes for Relational Data
Version 1.13.0 created on 2015-08-31.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
Mark S. Handcock, University of California -- Los Angeles David R. Hunter, Penn State University
Martina Morris, University of Washington
Skye Bender-deMoll, University of Washington For citation information, type citation("network"). Type help("network-package") to get started.
Loading required package: statnet.common
sna: Tools for Social Network Analysis
Version 2.4 created on 2016-07-23.
copyright (c) 2005, Carter T. Butts, University of California-Irvine For citation information, type citation("sna"). Type help(package="sna") to get started.
Parsed with column specification:
cols(
.default = col_integer(),
color = col_character(),
director_name = col_character(),
actor_2_name = col_character(),
genres = col_character(),
actor_1_name = col_character(),
movie_title = col_character(),
actor_3_name = col_character(),
plot_keywords = col_character(),
movie_imdb_link = col_character(),
language = col_character(),
country = col_character(),
content_rating = col_character(),
imdb_score = col_double(),
aspect_ratio = col_double()
)
See spec(...) for full column specifications. Warning message:
“4 parsing failures.
row col expected actual 2324 budget an integer 2400000000 2989 budget an integer 12215500000 3006 budget an integer 2500000000 3860 budget an integer 4200000000 ”
需要做的第一件事就是构建可生成网络图的对象。我们有很多库可以完成这件事情,而我选择了sna和network两个库。另外,我推荐包含GGally的GGplot2库可以将网络完美地可视化。我决定对这些进行一层简单的封装来获得我们需要的对象。
getIMDBGraph<-function(data,
firstyear =1,
lastyear = 3000,
genre = FALSE,
minscore = 0,
maxscore = 10){ if(is.character(genre)){ data<-data[grep(genre, data$genres),]} data<-subset(data, data$title_year >= firstyear & data$title_year <= lastyear & data$imdb_score <= maxscore & data$imdb_score >= minscore)
df<- data.frame(data$movie_title, data$actor_1_name, data$actor_2_name, data$actor_3_name)
df<- melt(df, id.vars = 'data.movie_title')
names(df)<-c('title', 'actornum', 'actor')
df<-df[,c(1,3)]
edges<-merge(x = df,
y = df,
by = 'title')
edges<-subset(edges,edges$actor.x != edges$actor.y)
edgelist<-as.matrix(edges[,c(2:3)])
graph<-network(edgelist) return(graph)}
现在就可以传递之前导入的按时间排序的数据了。在这次案例中,我使用的数据是2000年到2009年的。