June 17, 2014 – Added instructions to use block storage instead of instance storage
It is one thing to talk about technology. It is another thing to get it to work. Whereas my last blog post talked about the value of running Hadoop on a cloud, this one talks about my experience with implementing it. I used a Nebula appliance to deploy an OpenStack cloud and used Hortonworks Apache Ambari to setup a Hadoop cluster.
- Nebula Appliance – Provides a turnkey solution for deploying OpenStack clouds.
- Apache Ambari – Provides an installation, management and monitoring solution for Hadoop.
The following steps should be done prior to install. They can be done easily via Nebula’s graphical interface. These could also be done using the OpenStack command line or API.
- Create a new ssh key for the Hadoop cluster, e.g. ambari.pem. For help with ssh keys and password-less ssh, consult the “Set Up Password-less SSH” section in the Hortonworks install guide.
- Create a security group for the cluster. Hadoop uses a wide range of ports. To make the install simpler, I opened up ping and all TCP ports.
- You may also want to create a separate project just for Hadoop but I did not do that
Deploy a Management and Multiple Worker Nodes
I used a Centos cloud image for the install and all of the servers. The following steps discuss how to use Nebula to deploy four instances, one of the medium flavor for the master instance and three of large flavor for the worker instances.
STEP 1 – Deploy the Master Instance and Configure a Base Centos Image
– Deploy one instance of the Centos Image of the medium flavor
– ssh to the instance
ssh –i ambari.pem email@example.com
– Disable SE Linux
# edit /etc/sysconfig/selinux and make sure the following lines are set
– Disable IPTables
chkconfig iptables off
* Deploying Instances with Block Volume Storage versus Ephemeral Storage
The easiest way to deploy an instance is to use the Nebula graphical interface. For those that use the OpenStack command line interface, one might choose to use to use the persistent block volume storage versus ephemeral instance storage. Another reason is performance. Sometimes I see better performance with block storage than ephemeral storage.
Here are instructions on how to deploy an instances and deploy it from a block volume. This would need to be done for the master and all worker instances..
a) Create a block volume
cinder create —image-id <image-id> —display-name <volume name> <size in GB>
note: type ‘nova image-list’ to get a list of images and image id’s
b) Deploy an instance from a block volume
nova boot —flavor <flavor> —key-name <key> —security_groups <security group> —block_device_mapping vda=<volume id>:::0 <instance name>
note: type ‘cinder list’ to list the volumes and volume id’s
STEP 2 – Save the Image and Use it to Deploy Worker Instances
All of the previous steps need to be done for each worker instance so snapshot the current image and reuse it.
– Use Nebula to snapshot the instance and call it “ambari-base-image”.
– Deploy* multiple instances of “ambari-base-image” of the large flavor. I deployed 3 instances to be worker nodes.
Recommendation – A best practice is to deploy Hadoop instances and separate physical nodes. Using the OpenStack CLI, one way to do this is by deploying instances using nova boot and the “different_host” scheduler hint. Consult the OpenStack documentation on scheduling filters for more information.
STEP 3 – Setup DNS
All instances must be configured for DNS or Reverse DNS. There are different ways to do this. The following approach edits the host file on every instance in a cluster to contain the address of each instance and sets the fully qualified domain name for each instance.
– Edit the host file, /etc/host, and add a line for each instance in the cluster, e.g. add something like:
– Use the hostname command to set the hostname of each instance in the cluster, e.g.
– Edit the network configuration file, /etc/sysconfig/network, and set the following for each instance in the cluster:
HOSTNAME = fully.qualified.domain.name
– Restart networking
service network restart
This part of the install will install Ambari and setup the Ambari server. Execute these steps from the master instance.
– Setup the repos and consult the Hortonworks Ambari install guide for this information.
– Install Ambari
yum install ambari-server
– Setup the Ambari Server. I picked the default settings.
– Startup the Ambari Server
– Login to the installer, e.g. at http://your.ambari.server:8080 using the default username/password admin/admin.
Follow the GUI install. I made the following selections on the early screens.
Cluster name: ClusterOne
Stack: HDP 2.0
For install options, enter in all of the target hosts and the ssh key, e.g.
After Ambari confirms the host, it asks what services to install. I choose to install all services available. Ambari describes them.
I choose to assign all of the master components except ZooKeeper to a master instance.
For slaves and clients, I chose the following:
For customizing services, I set the required information like passwords but left the other options as default.
Then I began the deploy. Once it completes, the dashboard is
Now test out the Hadoop cluster!