Install Hadoop on an OpenStack Cloud

June 17, 2014 – Added instructions to use block storage instead of instance storage

It is one thing to talk about technology. It is another thing to get it to work. Whereas my last blog post talked about the value of running Hadoop on a cloud, this one talks about my experience with implementing it. I used a Nebula appliance to deploy an OpenStack cloud and used Hortonworks Apache Ambari to setup a Hadoop cluster.

Definitions

  • Nebula Appliance – Provides a turnkey solution for deploying OpenStack clouds.
  • Apache Ambari – Provides an installation, management and monitoring solution for Hadoop.

Prepare OpenStack

The following steps should be done prior to install. They can be done easily via Nebula’s graphical interface. These could also be done using the OpenStack command line or API.

  • Create a new ssh key for the Hadoop cluster, e.g. ambari.pem.  For help with ssh keys and password-less ssh, consult the “Set Up Password-less SSH” section in the Hortonworks install guide.
  • Create a security group for the cluster. Hadoop uses a wide range of ports. To make the install simpler, I opened up ping and all TCP ports.
  • You may also want to create a separate project just for Hadoop but I did not do that

Deploy a Management and Multiple Worker Nodes

I used a Centos cloud image for the install and all of the servers. The following steps discuss how to use Nebula to deploy four instances, one of the medium flavor for the master instance and three of large flavor for the worker instances.

STEP 1 – Deploy the Master Instance and Configure a Base Centos Image

– Deploy one instance of the Centos Image of the medium flavor

– ssh to the instance
ssh –i ambari.pem centos@10.130.52.105

– Disable SE Linux
setenforce 0

# edit /etc/sysconfig/selinux and make sure the following lines are set
   SELINUX=disabled
   SELINUXTYPE=targeted

– Disable IPTables
   chkconfig iptables off
   /etc/init.d/iptables stop

* Deploying Instances with Block Volume Storage versus Ephemeral Storage
The easiest way to deploy an instance is to use the Nebula graphical interface. For those that use the OpenStack command line interface, one might choose to use to use the persistent block volume storage versus ephemeral instance storage. Another reason is performance. Sometimes I see better performance with block storage than ephemeral storage.

Here are instructions on how to deploy an instances and deploy it from a block volume. This would need to be done for the master and all worker instances..

a) Create a block volume
cinder create —image-id <image-id> —display-name <volume name> <size in GB>
note: type ‘nova image-list’ to get a list of images and image id’s

b) Deploy an instance from a block volume
   nova boot —flavor <flavor> —key-name <key> —security_groups <security group> —block_device_mapping vda=<volume id>:::0 <instance name>
   note: type ‘cinder list’ to list the volumes and volume id’s

STEP 2 – Save the Image and Use it to Deploy Worker Instances

All of the previous steps need to be done for each worker instance so snapshot the current image and reuse it.

– Use Nebula to snapshot the instance and call it “ambari-base-image”.

– Deploy* multiple instances of “ambari-base-image” of the large flavor. I deployed 3 instances to be worker nodes.

Screen Shot 2014-05-12 at 8.57.32 PM

Nebula GUI Screenshot showing the Master and Worker instances

Recommendation – A best practice is to deploy Hadoop instances and separate physical nodes. Using the OpenStack CLI, one way to do this is by deploying instances using nova boot and the “different_host” scheduler hint. Consult the OpenStack documentation on scheduling filters for more information.

STEP 3 – Setup DNS

All instances must be configured for DNS or Reverse DNS. There are different ways to do this. The following approach edits the host file on every instance in a cluster to contain the address of each instance and sets the fully qualified domain name for each instance.

– Edit the host file, /etc/host, and add a line for each instance in the cluster, e.g. add something like:
   10.130.52.105 horton-master.novalocal
   10.130.52.93 horton-worker-1.novalocal
   10.130.52.97 horton-worker-2.novalocal
   10.130.52.12 horton-worker-3.novalocal

– Use the hostname command to set the hostname of each instance in the cluster, e.g.
   hostname fully.qualified.domain.name

– Edit  the network configuration file, /etc/sysconfig/network, and set the following for each instance in the cluster:
HOSTNAME = fully.qualified.domain.name

– Restart networking
service network restart

Deploy Ambari

This part of the install will install Ambari and setup the Ambari server. Execute these steps from the master instance.

Repo Info from Hortonworks Documentation

Repo Info from Hortonworks Documentation

– Setup the repos and consult the Hortonworks Ambari install guide for this information.

– Install Ambari
yum install ambari-server

– Setup the Ambari Server. I picked the default settings.
ambari-server setup

– Startup the Ambari Server
   ambari-server start

– Login to the installer, e.g. at http://your.ambari.server:8080 using the default username/password admin/admin.

Ambari Login

Ambari Login

Follow the GUI install. I made the following selections on the early screens.

Cluster name: ClusterOne

Stack: HDP 2.0

For install options, enter in all of the target hosts and the ssh key, e.g.

Screen Shot 2014-05-12 at 9.35.34 PM

Ambari Install Options

After Ambari confirms the host, it asks what services to install.  I choose to install all services available. Ambari describes them.

Screen Shot 2014-05-12 at 9.38.20 PM

Hadoop Services

I choose to assign all of the master components except ZooKeeper to a master instance.

Screen Shot 2014-05-12 at 9.40.31 PM

Assign master components to hosts

For slaves and clients, I chose the following:

Screen Shot 2014-05-12 at 9.41.18 PM

Assign Slaves and Clients

For customizing services, I set the required information like passwords but left the other options as default.

Then I began the deploy. Once it completes, the dashboard is

ready.

Screen Shot 2014-05-12 at 9.47.48 PM

Ambari Dashboard

Now test out the Hadoop cluster!

Screen Shot 2014-05-13 at 11.50.28 PM

Test with MapReduce Application

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s