Chinaunix首页 | 论坛 | 博客
  • 博客访问: 598125
  • 博文数量: 129
  • 博客积分: 6240
  • 博客等级: 准将
  • 技术积分: 1765
  • 用 户 组: 普通用户
  • 注册时间: 2009-03-18 15:42
文章分类

全部博文(129)

文章存档

2015年(1)

2012年(3)

2011年(6)

2010年(14)

2009年(105)

我的朋友

分类: 系统运维

2009-03-24 09:59:31

Amazon S3 provides a good distributed storage solution for accessing the distributed computing power of Amazon EC2. Using Ruby scripts like s3sync and s3cmd at the command line, you can move data to and from EC2 instances in your computing cloud.

Introducing Amazon S3 and Amazon EC2

The distributed computing power of the Amazon Elastic Compute Cloud (Beta) (Amazon EC2™) isn't going to do you much good unless you can get data to and from each of the Amazon EC2 instances in your computing cloud. Amazon's Simple Storage Service (Amazon S3) provides a good distributed storage solution for doing this. Plus, Amazon does not charge for Amazon EC2 instances to read and write data from Amazon S3 buckets.

That's all well and good, but how do we get Amazon EC2 instances to read and write files to Amazon S3? I tried several different approaches while writing this article, and the easiest turned out to be a set of Ruby command-line scripts called s3sync. In this article, I'll show you how to set up an Amazon EC2 image using one of Amazon's Fedora Core 4 images, then show you how to install and use the s3sync code to access files from Amazon S3.

Before you begin, make sure you have Amazon EC2 command-line utilities installed. Instructions on how to do this are available on site. Amazon also has a complete for its EC2 web service, which is what I used to help me as I was writing this article. It clearly describes how to install the tools, generate a key set, and build Amazon EC2 instances.

Next, you have to determine the operating system image you are going to run on the Amazon EC2 instances. For the purposes of this article we are going to use one of the images provided by Amazon. Let's have a look at what those are by using the ec2-describe-images command-line utility:

% ec2-describe-images -o amazon
IMAGE ami-20b65349 ec2-public-images/fedora-core4-base.manifest.xml amazon available public
IMAGE ami-22b6534b ec2-public-images/fedora-core4-mysql.manifest.xml amazon available public
IMAGE ami-23b6534a ec2-public-images/fedora-core4-apache.manifest.xml amazon available public
IMAGE ami-25b6534c ec2-public-images/fedora-core4-apache-mysql.manifest.xml amazon available public
IMAGE ami-26b6534f ec2-public-images/developer-image.manifest.xml amazon available public
IMAGE ami-2bb65342 ec2-public-images/getting-started.manifest.xml amazon available public
IMAGE ami-bd9d78d4 ec2-public-images/demo-paid-AMI.manifest.xml amazon available public

I'll choose the fedora-core4-apache-mysql operating system image, because that's the kind of thing I would get from a hosting company, and it's sure to be full of useful utilities. I'll run an instance of that image using the following commands at the command line:

% ec2-run-instances ami-25b6534c -k gsg-keypair
RESERVATION r-e349af8a 961421114855 default
INSTANCE i-59c02230 ami-25b6534c pending gsg-keypair 0

After the image has booted, the Amazon EC2 command-line utility will give me a hostname. I'll check the name by using the ec2-describe-instances command:

% ec2-describe-instances
RESERVATION r-e349af8a 961421114855 default
INSTANCE i-59c02230 ami-25b6534c ec2-72-44-57-99.z-1.compute-1.amazonaws.com domU-12-31-36-00-3D-83.z-1.compute-1.internal running gsg-keypair 0

Now I have a machine running Fedora Core 4 with a lot of handy stuff installed in it. Next is to log into the Amazon EC2 instance that I just created using the hostname provided by ec2-describe-instances.

% ssh -i ~/.ec2/id_rsa-gsg-keypair root@ec2-72-44-57-99.z-1.compute-1.amazonaws.com     
The authenticity of host 'ec2-72-44-57-99.z-1.compute-1.amazonaws.com (72.44.57.99)' can't be established.
RSA key fingerprint is f1:4e:d1:14:87:f0:57:71:89:6e:ed:b5:1c:14:84:b5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-72-44-57-99.z-1.compute-1.amazonaws.com,72.44.57.99' (RSA) to the list of known hosts.

__| __|_ ) Rev: 2
_| ( /
___|\___|___|

Welcome to an EC2 Public Image
:-)

Apache2+MySQL4


__ c __ /etc/ec2/release-notes.txt

[root@domU-12-31-36-00-3D-83 ~]#

Now you have your sample files set up, you're logged in to your Amazon tools, and you are ready to apply the example scripts.

Installing S3Sync

S3sync is the Ruby package I will use to add, update, remove, and list files on the Amazon S3 servers. To do that I will first need to ensure that Ruby is installed, then get the s3sync package and set it up.

To check the Ruby version, I use the following command line:

[root@domU-12-31-36-00-3D-83 ~]# ruby -v
ruby 1.8.4 (2005-12-24) [i386-linux]
[root@domU-12-31-36-00-3D-83 ~]#

This tells me my computer is running a recent version of Ruby--version 1.8.4--which is a version that allows the script to run. This should do nicely.

There are two ways that I can get the s3sync code. The first is to go to the s3sync web site () and download it to my local computer. I would then copy it the Amazon EC2 instance. To do that I would use this command:

% scp -i ~/.ec2/id_rsa-gsg-keypair s3sync.tar.gz root@ ec2-72-44-57-99.z-1.compute-1.amazonaws.com:/root

The s3sync.tar.gz file would then be located in my home directory on the Amazon EC2 machine.

I can also do this directly from the Amazon EC2 instance using the following commands:

[root@domU-12-31-36-00-3D-83 ~]# wget 
--18:31:18--
=> `s3sync.tar.gz'
Resolving s3.amazonaws.com... 72.21.206.171
Connecting to s3.amazonaws.com|72.21.206.171|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26,667 (26K) []

100%[============================================================================>] 26,667 --.--K/s

18:31:19 (3.21 MB/s) - `s3sync.tar.gz' saved [26667/26667]

[root@domU-12-31-36-00-3D-83 ~]#

Either way, after s3sync has been copied to my computer, the next thing to do is unpack it.

[root@domU-12-31-36-00-3D-83 ~]# tar -xzvf s3sync.tar.gz 
s3sync/
s3sync/HTTPStreaming.rb
s3sync/README.txt
s3sync/README_s3cmd.txt
s3sync/S3.rb
s3sync/s3cmd.rb
s3sync/s3config.rb
s3sync/s3config.yml.example
s3sync/S3encoder.rb
s3sync/s3sync.rb
s3sync/s3try.rb
s3sync/S3_s3sync_mod.rb
s3sync/thread_generator.rb
[root@domU-12-31-36-00-3D-83 ~]#

Now we have all the s3sync files in a subdirectory and I can start moving files to and from my Amazon S3 bucket. But before I do that, I have to set two environment variables:

[root@domU-12-31-36-00-3D-83 s3sync]# AWS_ACCESS_KEY_ID=xxxx
[root@domU-12-31-36-00-3D-83 s3sync]# export AWS_ACCESS_KEY_ID
[root@domU-12-31-36-00-3D-83 s3sync]# AWS_SECRET_ACCESS_KEY=xxxx
[root@domU-12-31-36-00-3D-83 s3sync]# export AWS_SECRET_ACCESS_KEY

Change the xxxx in lines 1 and 3 to your own Access Key ID and Secret Access Key (available from the ).

If everything is working properly, you should be able to use the s3cmd.rb script to list all available buckets:

[root@domU-12-31-36-00-3D-83 s3sync]# ./s3cmd.rb listbuckets
jherr_video
[root@domU-12-31-36-00-3D-83 s3sync]#

To test this I'm going to create a test bucket. If you aren't familiar with Amazon S3 buckets, a bucket is similar to a disk drive. You can have as many buckets as you like, each with a unique name and each containing its own set of directories and files.

I'll create a bucket for this article using the following command:

# ./s3cmd.rb createbucket art072407
#

Then, I check to see whether it worked by using the listbuckets command again:

# ./s3cmd.rb listbuckets           
art072407
jherr_video
#

Now I can list the contents of the bucket using the list command.

# ./s3cmd.rb list art072407
--------------------
#

The output tells me there is nothing in the bucket. So let's put something in it. Just to test it, I'll put the Readme.txt file that comes with the s3sync code into the bucket.

# ./s3cmd.rb put art072407:Readme.txt Readme.txt 
#

The put command copies the file to the Amazon S3 bucket. The first parameter after the put command is the bucket and the key name. The bucket name is before the colon, and the key name comes after the colon. In Amazon S3 terms, files are "keys" because, really, Amazon S3 can store any data bit. Normally though, your key will be the same as your file name. The last parameter is the name of the local file to copy.

I can then use the list command to see that the file is still in the bucket:

# ./s3cmd.rb list art072407
--------------------
Readme.txt
#

One great thing about Amazon S3 is that all uploaded files are available as URLs from a web browser (or any application that can read a URL). The format of the URL is as follows:


In the case of this example, the URL is:


But if I go to the URL at this point, I’ll get a message telling me that access to that resource is denied because, by default, uploaded data is not publicly accessible. To make it publicly accessible, we have to add to the put command:

# ./s3cmd.rb put art072407:Readme.txt Readme.txt x-amz-acl:public-read
#

Now, if I go back to that URL in my web browser, Amazon S3 will happily show me the Readme.txt file.

To remove the file from the bucket, I run the delete command:

# ./s3cmd.rb delete art072407:Readme.txt
#

Or, to delete everything in the bucket, I run the deleteall command:

# ./s3cmd.rb deleteall art072407
#

As noted above, you can use a URL to get to the data if the Amazon S3 key (the file) is designated as public. To get public data, you can use the following command:

# wget 
...

But what if the data is private? To do that I use the handy get command that comes with s3cmd.rb.

# ./s3cmd.rb get art072407:Readme.txt Out.txt
#

This command takes the Readme.txt file from the Amazon S3 bucket and copies it to the local file Out.txt.

S3Sync

So far I've worked only with reading and writing a single file from the Amazon S3 bucket. What about entire directories of files, with nested subdirectories, and so on? Ruby code has a solution for that as well. The s3sync.rb command synchronizes whole directory structures with Amazon S3 buckets.

To begin I'll create a new directory called /root/data and copy the contents of the s3sync code to it, just as an example:

# mkdir /root/data
# copy /root/s3sync/* /root/data
#

Now, I'll clear out the Amazon S3 bucket and copy the directory to it using s3sync:

# ./s3cmd.rb deleteall art072407
# ./s3sync.rb -r /root/data/ art072407:/
#

When I list the article bucket now, I can see all the original files:

# ./s3cmd.rb list art072407
--------------------
HTTPStreaming.rb
Readme.txt
Readme_s3cmd.txt
S3.rb
S3_s3sync_mod.rb
S3encoder.rb
s3cmd.rb
...
#

Next, I can remove all of the files from the /root/data directory and re-sync them using s3sync. First, to remove them, I use the following code:

# rm /root/data/*
#

Now, to re-sync from the Amazon S3 bucket I run:

# ./s3sync.rb -r art072407: /root/data
# ls -la /root/data/
total 120
drwxr-xr-x 2 root root 4096 Jul 24 11:59 .
drwxr-x--- 5 root root 4096 Jul 24 11:48 ..
-rwxr-xr-x 1 root root 3427 Jul 24 11:59 HTTPStreaming.rb
-rwxr-xr-x 1 root root 12775 Jul 24 11:59 Readme.txt
-rwxr-xr-x 1 root root 4525 Jul 24 11:59 Readme_s3cmd.txt
...
#

Now I can get and put whole directories of data using Amazon S3 from my Amazon EC2 instance.

To finish, I'm going to delete the contents of the bucket, and then delete the bucket itself:

# ./s3cmd.rb deleteall art072407   
# ./s3cmd.rb deletebucket art072407

To finish working with this example completely, I'm going to delete the Amazon EC2 instance that I was testing:

% ec2-terminate-instances i-59c02230
%

And there you have it: Amazon Simple Storage Service (Amazon S3) access, direct from one of the standard Fedora Core 4 Amazon images with just some simple Ruby scripts and a few environment variables.

Conclusion

Amazon S3 provides a powerful mechanism for moving data between Amazon EC2 instances, and for moving data to and from Amazon EC2 instances for distributed processing. Because all of the languages supported by the Amazon Fedora Core 4 images can access the command line, it's easy to invoke these commands from within page code or batch processors to get and put data from the Amazon S3 buckets.

Jack Herrington is the author of several books, including , , and . He has also written over 50 articles on technical topics, many of which use PHP. Jack is a PHP and AJAX columnist for IBM developerWorks, and the editor of the AJAX Forum on the IBM developerWorks web site.

阅读(1310) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~