spark RDD -OowarrioroO-ChinaUnix博客

MyLinuxBlog

首页　| 　博文目录　| 　关于我

OowarrioroO

博客访问： 298250
博文数量： 82
博客积分： 0
博客等级：民兵
技术积分： 874
用户组：普通用户
注册时间： 2015-03-21 09:58

个人简介

traveling in cumputer science!!

文章分类

全部博文（82）

C++（3）
MongoDB（6）
字符编码（1）
linux（9）
NLP（1）
other（0）
spark（15）
python（20）
android（7）
JAVA（1）
搜索引擎（2）
git（1）
Algorithm（5）
myLinuxCoding（11）
未分配的博文（0）

文章存档

2016年（13）

2015年（69）

我的朋友

zhaoriti

相关博文

spark RDD

分类：大数据

2015-07-17 20:09:04

1. SPARK 中 RDD 的基本操作流程
    RDD创建->RDD转换->RDD控制->RDD运行。
    RDD创建：RDD的初始创建是由SparkContent来负责的，有内存或者外存文件系统作为数据源
    RDD转换：讲一个RDD通过一定的操作转换为另一种RDD
    RDD控制：对RDD进行持久化，令RDD保存在磁盘或者内存中，以便后续重复使用
    RDD运行：RDD运行出发Spark作业的运行，输出计算结果，结果分为两类，一类声称Scala集合或者标量，另一种保存到外部文件系统中
    示例程序：
    val sc = new SparkContent(集群地址，程序标识，spark安装路径，spark程序JAR包)//RDD创建
    val file = sc.textFile(文件路径)//RDD转换
    val filterRDD = file.filter(操作函数)//RDD转换
    filterRDD.cache()//RDD控制
    filterRDD.count()//RDD运行

2. RDD介绍
    RDD的生成有两种途径：1）来自于内存或者外部存储系统；2）通过其他RDD转换例如：map,filter,join......
    RDD two operator: 1) Transformation ,2)Action
        1) Transformation is an delay operator, it will run when an action is run;
            at this operator data have two types:
                (1) Value ,that can be used directly
                (2)Key-value pair,that is packaged in pairRDDFunctions ,and user use it must import org.apache.spark.SparkContent._
        2) Action will trigger spark submit job.

阅读(1141) | 评论(0) | 转发(0) |

上一篇：function summary of nltk in python

下一篇：android 存储结构与存储方法

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6