Nimbus环境搭建(七)-benxiong-ChinaUnix博客

benxiongbenxiong.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

benxiong

博客访问： 1755929
博文数量： 163
博客积分： 10591
博客等级：上将
技术积分： 1980
用户组：普通用户
注册时间： 2006-08-08 18:17

文章分类

全部博文（163）

LOG（1）
English（0）
Economy（1）
Monitor（12）

Nagios（9）
Unix（9）

AIX（4）

HP-UX（5）
Middleware（3）

Weblogic（3）
linux kernel（1）
Cloud（9）

Nimbus（8）

IaaS（0）
Virtualization（5）

Xen（5）
Program（33）

VBS（1）

批处理（3）

Perl（2）

Shell（12）

Linux C/C++（15）

HTML（0）
Others（8）

Android（1）

Bank（1）

phone（4）
Database（33）

Oracle PL/SQL（0）

Oracle常见问题（10）

Oracle安装卸载（3）

Oracle性能优化（3）

Oracle（10）

MySQL（7）
Equipment（7）

Tivoli Storage M（1）

Storage（6）

Printer（0）
Windows（1）
linux operation（18）
Network（10）

Voice（1）

Router（5）

Wireless（0）

Switch（2）
linux server（12）

DNS（1）

LVS（4）

Torque（1）

svn-cvs（1）

Cluster（2）

HTTP（2）

SAMBA（0）

MediaWiki（0）

DHCP（0）

LDAP（0）

NIS（0）
未分配的博文（0）

文章存档

2018年（1）

2012年（1）

2011年（47）

2010年（58）

2009年（21）

2008年（35）

我的朋友

相关博文

Nimbus环境搭建(七)

分类： LINUX

2010-03-19 11:04:53

One Click Clusters

介绍

上面的三部分我们已经实现了基本的nimbus功能，参看下图：

上面的部署和测试证明已经实现了cloud client -> WSRF -> Workspace service -> workspace resource manager -> workspace control 这条线。

但是，nimbus还有一个特色功能，就是One Click Clusters，即通过部署context broker和在虚机镜像中安装的context agent，实现了从cloud client端通过一条命令就可以启动一个机群。参看下图及说明来理解：

这里的“you”节点，可以理解为cloud-client节点；“cloud service”可以理解为Workspace Service节点；“ctx broker”既是context broker组件；八个圆圈可以理解为虚机集群。

步骤如下：

1. 从cloud-client下发集群操作指令

2. cloud service将集群的信息传递给context broker，并将集群中的虚机所需要的信息（bootstrap）也传递给context broker。

3. cloud service提供给宿主机所需的镜像文件，并提供如何连接context broker 的信息。宿主机可以载入不同的虚机镜像，并且同一镜像载入的次数也没有限制。即虚机集群可以是有不同特色的虚机组成。

4. 镜像内安装了context agent组件。利用这个组件，宿主机在启动镜像文件时，可以通过https链接到context broker获取所需的（bootstrap）信息，从而完成配置。

5. 远程客户端可以通过context broker查询信息。最主要的是查询每一个集群节点的SSHd public key。这里，cloud-client将会安装上这些public key。

安装部署

nimbus server上进行的操作

1）安装 nimbus-context-broker

以globus用户执行：

[globus@wang136 ~]$pwd

/home/globus

[globus@wang136 ~]$ tar zxvf nimbus-context-broker-TP2.2.tar.gz

[globus@wang136 ~]$ cd nimbus-context-broker-TP2.2

[globus@wang136 nimbus-context-broker-TP2.2]$ ./deploy-broker.sh

）配置无密码的CA Certificates

以globus用户执行：

[globus@wang136 ~]$pwd

/home/globus

[globus@wang136 ~]$cp .globus/simpleCA/private/cakey.pem ~/

[globus@wang136 ~]$cp .globus/simpleCA/cacert.pem ~/

[globus@wang136 ~]$ openssl rsa -in cakey.pem -out cakey-unencrypted.pem

Enter pass phrase for cakey.pem:

writing RSA key

[globus@wang136 ~]$chmod 400 ca*.pem

[globus@wang136 ~]$mv cakey-unencrypted.pem .globus/simpleCA/

切换到root用户执行

[root@wang136 ~]$mv /home/globus/cacert.pem /etc/grid-security/certificates/ 注：是需要将cacert.pem拷贝到trust certificate目录中。你会发现cacert.pem和2f982487.0内容一模一样。

）配置jndi-config.xml文件

更改/usr/local/globus/etc/nimbus-context-broker/jndi-config.xml文件，将"caCertPath" and "caKeyPath"参数指向刚才的cacert.pem和cakey-unencrypted.pem。如下：

caCertPath

/etc/grid-security/certificates/cacert.pem

caKeyPath /home/globus/.globus/simpleCA/cakey-unencrypted.pem

）重启container

我这里重启了操作系统，然后再启动container，呵呵

[globus@wang136 ~]$ globus-start-container

2009-06-10 14:14:23,030 INFO defaults.DefaultAssociationAdapter [main,validate:191] MAC prefix: "A2:AA:BB"

2009-06-10 14:14:23,077 INFO defaults.Util [main,loadDirectory:244] file modification time for network 'public' is not newer, using old configuration

2009-06-10 14:14:23,097 WARN defaults.Util [main,loadDirectory:228] not a file: '/usr/local/globus-4.0.8/etc/nimbus/workspace-service/network-pools/.backups'

2009-06-10 14:14:23,190 INFO defaults.DefaultAssociationAdapter [main,validate:243] Network 'public' loaded with 5 addresses.

┇

省略

┇

Starting SOAP server at:

With the following services:

[1]: AdminService

[2]: AuthzCalloutTestService

[3]: ContainerRegistryEntryService

[4]: ContainerRegistryService

[5]: CounterService

[6]: ElasticNimbusService

[7]: JWSCoreVersion

[8]: ManagementService

[9]: NimbusContextBroker

[10]: NotificationConsumerFactoryService

[11]: NotificationConsumerService

[12]: NotificationTestService

[13]: PersistenceTestSubscriptionManager

[14]: SampleAuthzService

[15]: SecureCounterService

[16]: SecurityTestService

[17]: ShutdownService

[18]: SubscriptionManagerService

[19]: TestAuthzService

[20]: TestRPCService

[21]: TestService

[22]: TestServiceRequest

[23]: TestServiceWrongWSDL

[24]: Version

[25]: WidgetNotificationService

[26]: WidgetService

[27]: WorkspaceContextBroker

[28]: WorkspaceEnsembleService

[29]: WorkspaceFactoryService

[30]: WorkspaceGroupService

[31]: WorkspaceService

[32]: WorkspaceStatusService

[33]: gsi/AuthenticationService

看到多了什么了吗？ NimbusContextBroker

我们需要在虚机镜像中安装nimbus-ctx-agent-2.2.tar.gz。所以使用xm create将虚机镜像启动起来，然后使用xm console登陆。

首先将nimbus-ctx-agent-2.2.tar.gz包拷贝到虚机里，然后执行如下检查：

[root@localhost ~]# python -V

Python 2.4.3 要2.3+的

[root@localhost ~]# curl -V

curl 7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5

Protocols: tftp ftp telnet dict ldap http file https ftps 确保有https

Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz

[root@localhost ~]# tar zxvf nimbus-ctx-agent-2.2.tar.gz

[root@localhost ~]# mkdir -p /opt/nimbus

[root@localhost ~]# mv nimbus-ctx-agent-2.2/* /opt/nimbus/

[root@localhost ~]# /opt/nimbus/ctx/launch.sh

2009-06-10 10:55:17,740 DEBUG @448: [stdout logging enabled]

2009-06-10 10:55:17,742 DEBUG @3237: [file logging enabled @ '/opt/nimbus/ctxlog.txt']

2009-06-10 10:55:17,743 DEBUG @3240: action --tryall

2009-06-10 10:55:17,743 INFO @3285: First running regular instantiation action

2009-06-10 10:55:17,743 ERROR @3298: Problem with regular instantiation action: InvalidConfig: metadata server URL path '/var/nimbus-metadata-server-url' does not exist on filesystem

2009-06-10 10:55:17,744 INFO @3300: Second, running Amazon instantiation action

2009-06-10 10:55:17,745 DEBUG @576: program starting 'curl --silent --url 2007-01-19/meta-data/public-ipv4 -o /dev/stdout'

2009-06-10 10:55:20,972 DEBUG @607: program ended: 'curl --silent --url 2007-01-19/meta-data/public-ipv4 -o /dev/stdout'

2009-06-10 10:55:20,973 DEBUG @1835: 'curl --silent --url 2007-01-19/meta-data/public-ipv4 -o /dev/stdout': exit=1792, stdout='', stderr=''

2009-06-10 10:55:20,973 ERROR @1840: PROBLEM: curl command failed, result: 'curl --silent --url 2007-01-19/meta-data/public-ipv4 -o /dev/stdout': exit=1792, stdout='', stderr=''

2009-06-10 10:55:20,974 CRITICAL @3370: Problem executing: Couldn't obtain pub IP

No error reporting action was configured, cannot inform context broker of this problem.

InvalidConfig: metadata server URL path '/var/nimbus-metadata-server-url' does not exist on filesystem

这句话没问题，等到虚机启动时，回去context broker上去获取nimbus-metadata-server-url的。

后面的错误也没有问题，因为是内网ip，所以肯定连不上EC2的网站的。

好了，我们可以将镜像关闭，重新命名为rhel5-vm1-agent,上传到cloud云里。

验证

在cloud client端执行的操作

使用nimbus用户执行：

[nimbus@wang135 samples]$ pwd

/home/nimbus/nimbus-cloud-client-011/samples

[nimbus@wang135 samples]$ vi rhel5-vm1-cluster.xml

Master

rhel5-vm1-agent

public

"/O=Grid/OU=GlobusTest/OU=simpleCA-wang136.hrwang.com/OU=hrwang.com/CN=Hongrui Wang" nimbus

]]>

Slave

rhel5-vm1-agent

public

[nimbus@wang135 nimbus-cloud-client-011]$ ./bin/cloud-client.sh --run --hours 1 --cluster samples/rhel5-vm1-cluster.xml

SSH public keyfile contained tilde:

- '~/.ssh/id_rsa.pub' --> '/home/nimbus/.ssh/id_rsa.pub'

SSH known_hosts contained tilde:

- '~/.ssh/known_hosts' --> '/home/nimbus/.ssh/known_hosts'

Requesting cluster.

- Master: image 'rhel5-vm1-agent', 1 instance

- Slave: image 'rhel5-vm1-agent', 1 instance

Context Broker:

Created new context with broker.

Workspace Factory Service:

Creating workspace "Master"... done.

- 192.168.1.2 [ client2 ]

Creating workspace "Slave"... done.

- 192.168.1.3 [ client3 ]

Launching cluster-001... done.

Waiting for launch updates.

- cluster-001: all members are Running

- wrote reports to '/home/nimbus/nimbus-cloud-client-011/history/cluster-001/reports-vm'

Waiting for context broker updates.

注：我期望的是指通过，来使启动起来的两个虚机的/etc/hosts文件可以同步。但好像是xml文件内容太少了，没有指定相应的role的缘故，所以卡在这里不动了。官方给的都是打好的用于One Click镜像，镜像里面应该在/opt/nimbus/ctx-scripts/下包含了相应的脚本。并且官方的xml文件也提供了实际的role。

总结

一些机制的认识

1）nimbus如何分配VMM资源？

首先讲一下使用的资源管理方式是resource pool，而非复杂些的pilot的情况。假如我们在resource pool 里添加了两台VMM节点（cm18、cm20），每台节点有2048M内存是给nimbus使用的，并且磁盘空间都足够大。但我一下创建8个256M的vm（虚机）时，这些虚机会怎么分配呢？通过下面的例子来看一下：

$GLOBUS_LOCATION/bin/workspace \

>>     --poll-delay 200 \

>>     --numnodes 8 \

>>     --deploy \

>>     --file devgroup.epr \

>>     --groupfile devgroup_master.epr \

>>     --metadata /home/globus/devenv.xml \

>>     -s  \

>>     --deploy-duration 120 --deploy-mem 256 --deploy-state Running

当命令执行完成后，会发现在cm18这台VMM节点上部署了7台虚机，在cm20这台VMM节点上部署了1台虚机，

>> cm18:/home/opt/workspace/secureimages # ls

>> wrksp-43  wrksp-45  wrksp-46  wrksp-47  wrksp-48  wrksp-49  wrksp-50

>>

>> cm20:/opt/workspace/secureimages # ls

>> wrksp-44

这是为什么呢？呵呵看下面的解释：

Anyhow, the simple logic is like so:

1. Look for VMMs in the list that have no memory deductions from any VMs

2. Does the VMM have enough memory to fulfil the request?

3. If so, does the VMM support the networks required by the request?

4. If so, pick it.  Deduct required memory.

5. If list is traversed without an answer, repeat without the "no memory

deductions" requirement, i.e., start to put more than one VM on a VMM.

If you're dealing with any coscheduled request, this process is repeated N

times for the request (under lock, etc.).

解释一下：

1. 当请求建立虚机后，首先会从VMMs列表里查找从未被消耗内存的节点

2. 找到这个节点后，看这个VMM的资源是否可以完成请求

3. 如果有足够资源，再看这个VMM是否支持请求的网络

4. 如果也支持请求的网络，就使用这个VMM来创建虚机。

5. 如果遍历了VMMs列表，也没有未分配虚机的节点。则去掉“未消耗内存”的条件，进行重新查找，即从第2开始查找合适的VMM。

这样可以理解为什么上面的例子，现在cm18上创建wrksp-43，然后在cm20上创建wrksp-44，接着剩下的全部创建在cm18上了吧。

2）如何处理手工关掉的vm？

通过cloud client创建启动起来vm，如果设置的时间到期，那么存在于VMM节点/opt/workspace/secureimages/下的镜像文件会被删除。可是如果未到期前，登录vm后使用shutdown关闭掉它，那么镜像文件会依然存在于VMM节点上，这时怎么重新启动它呢？

这种情况下，我们可以在cloud client端，这样做：

./lib/workspace.sh --start -e history/vm-001/vw-epr.xml

详情看下面的Tim回答的邮件内容：

> 2. How to process shutdown vms?
> I startup vm by cloud client command. Then, log on that vm and shut down
> it. The image still stay in /opt/workspace/securityimages on VMM. I can't
> handle that vm by cloud client too. How shoud I restart it again.

I am not sure I understand this question. You have run the "shutdown" program
inside the VM but you also have the cloud client "handle" (such as "vm-001")?

In that case, there is no cloud client support for this but you should be able
to drop down to the "real" client, like so:

./lib/workspace.sh --start -e history/vm-001/vw-epr.xml

That program (see "lib/workspace.sh -h") is tucked away in the cloud client
distribution for situations like this.

3）Cloud Schedule的一些理解

通过globus toolkit和nimbus，我们可以讲现有的HPC集群（如Torque集群）加入到IaaS平台里，即加入到云架构里。这个功能其实也是nimbus的主要应用场景，因为国外的很多大学里都建有高性能计算集群，或者称之为并行计算集群。那么云的资源调度如何理解呢？看看下面的这个结构图：

上图只是展示了基本的云调度位置，功能和交互。系统功能如下：当用户提交了任务到一个job scheduler（任务调度器的作用是将任务排队，并分配资源进行处理）。cloud scheduler 可以访问job scheduler 去获取当前任务信息（例如队列中的任务数量，任务需求的资源类型）。cloud scheduler也可以访问通过MDS来维护和更新的当前云的信息（包括在哪些物理节点上运行着哪些虚机）。

基于上面收集的信息（从job scheduler和MDS处获取），cloud scheduler 可以决定创建指定的虚机来完成任务列表里的job。也就是说，cloud scheduler为了生成适合任务执行的环境而创建，销毁虚机。

进一步的讲解：

云会从配置文件（如Nimubs使用的VMM pool 配置文件）获取集群及其节点信息，并建立一个内部结构。动态的集群信息会从cloud MDS处获得（这个功能将来会完成）。

1. 从静态配置文件中读取包含的节点信息。

2. 存储这些信息到一个动态数据结构中。

3. 通过nimbus workspace control 命令从任意选择的资源上启动虚机。

4. 更新内部数据结构，来反映出被虚机使用的资源情况。

5. 通过nimbus workspace control命令来销毁创建的虚机。

6. 更新内部数据结构，来反映虚机销毁后资源的情况。

注：原本的高性能集群由一个headnode，和多个worknode组成，在这个集群中有Job Scheduler。

一些不足

目前不支持高可用

详情见如下Tim邮件内容:

> how you manage failures in physical nodes, how you make recovery virtual
> machines when the node where they are running crashes. It is possible
> recover this virtual machines?

Nimbus cannot handle these kind of failures currently, like with EC2 it's up to
the user to architect a failure scenario. The Nimbus administrator can
probably recover the image file used as the instance's disk in an emergency..

的镜像不能使windows

I am new to Nimbus and wonder if Windows VM is support by Nimbus since this
> is critical to the cloud we are building. I tried to do some search but
> couldn't get any relevant information. Could you please comment on the status
> of Nimbus with Windows virtual machine? Thank you very much.

No it cannot support that. The VMM controlling software is modular, it could
be fitted to control any VM platform, really. But the current tools developed
for Nimbus on the VMM are Linux-centric so I think this product would be less
likely to be developed than something else because it is Windows based.

镜像根分区必须挂载到sda1

目前，nimbus支持的镜像文件只能是ext3文件系统镜像。整个系统只有一个分区，即根分区,且只能挂载到sda1。 ext3文件系统这样创建：

dd if=/dev/zero of=/opt/cloud.img bs=1M count=5000

mkfs.ext3 /opt/cloud.img

支持的vm，只能在启动时dhcp配置一个网卡

vm启动过程中，dhcp只能对一块网卡分配IP地址，如果需要多块网卡，只能手工登陆vm进行配置。