Chinaunix首页 | 论坛 | 博客
  • 博客访问: 2467822
  • 博文数量: 960
  • 博客积分: 13195
  • 博客等级: 上将
  • 技术积分: 7653
  • 用 户 组: 普通用户
  • 注册时间: 2007-01-19 14:41
个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文(960)

文章存档

2018年(1)

2017年(69)

2016年(100)

2015年(92)

2014年(198)

2013年(4)

2012年(142)

2011年(136)

2010年(86)

2009年(58)

2008年(39)

2007年(35)

分类: 大数据

2017-11-16 18:38:32

AWS EMR 启动的机器都很贵,想在3台t2.micro上搭建一个hadoop集群,因为t2.micro 内存1G不够跑spark,换用t2.small的2G内机器。

0. download java without auth:
wget --no-check-certificate -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u151-b12/e758a0de34e24606bca991d704f6dcbf/jdk-8u151-linux-x64.tar.gz

1. choose AMI:
bitnami-hadoop-2.8.2-0-linux-debian-8-x86_64-hvm-ebs - ami-00654a65

2.configure security group ports (https://docs.bitnami.com/aws/apps/hadoop/)
include: 22, 80, 443

Each daemon in Hadoop listens to a different port. The most relevant ones are:

  • ResourceManager:
    • Service: 8032
    • Web UI: 8088
  • NameNode:
    • Metadata: 9000
    • Web UI: 50070
  • Secondary NameNode:
    • Metadata: 50090
  • DataNode:
    • Data transfer: 50010
    • Metadata: 50020
    • WebUI: 50075
  • Timeline Server:
    • Service: 10200
    • WebUI: 8188
  • Hive:
    • Hiveserver2 binary: 10000
    • Hiveserver2 HTTP: 10001
    • Metastore: 9083
    • WebHCat: 50111
    • Derby DB: 1527


3.hadoop configurations:
ssh -i jameson-keypair.pem bitnami@13.59.230.131
/opt/bitnami/hadoop/etc/hadoop/core-site.xml
hdfs://localhost:9000
/opt/bitnami/hadoop/etc/hadoop/hdfs-site.xml
1
/opt/bitnami/hadoop/etc/hadoop/yarn-site.xml
yarn.resourcemanager.hostname NEED TO BE SET!
/opt/bitnami/hadoop/etc/hadoop/mapred-site.xml




1. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to 
parallelize (e.g. sc.parallelize(data, 10)).
2.The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.

阅读(90) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~
评论热议
请登录后评论。

登录 注册