Hadoop and OpenStack cloud are two of the hottest items in technology. Wouldn’t it be great to pair them up? Such a union has been presented in “Hadoop on OpenStack” by HortonWorks’ Himanshi Bari. Beyond the buzz factor, let us explore the value in marrying Hadoop with OpenStack.
Countless enterprises are pursuing cloud projects and, perhaps the most important question they need to answer is what applications will run in the cloud. Is Hadoop, as Shaun Connel had posted last summer on the Hortonworks’ blog site, “the Perfect App for OpenStack”. Hortonworks, by the way, recently raised $100 million in venture funding in March 2014. There is precedence for Hadoop-as-a-service cloud solutions, as demonstrated by the Amazon Web Service’s Elastic Map Reduce (AWS EMR) offering that is used by companies like Yelp. There is also an OpenStack project called “Sahara” that will, when finished, enable enterprises to more easily build Hadoop-as-a-service offerings.
What is Hadoop?
Hadoop is new and many enterprises are still learning about it. To understand it, consider this colored candy analogy. Let us say that one wants to learn something from a bowl of multicolored candy. For example, what is the ratio of red to green pieces? With a small amount of candy, it is easy for one person to sort the candy and make an analysis. What if there are hundreds of pieces? Wouldn’t it be better to divide the candies into multiple bowls and have helpers sort each bowl? That’s the Hadoop approach. Hadoop would divide the candy and spread the work among different workers. Hadoop breaks up data analysis so that different “workers” or compute nodes can work on smaller pieces in mulitple parallel steps. These “workers” could be deployed and managed by OpenStack Infrastructure-as-a-Service (IaaS).
Hadoop essentially implements a large distributed pipeline that is extremely fast and efficient at answering questions on big data sets. It is very useful for data scientist and those solving ad hoc problems. What it is not is a replacement for business intelligence users who do queries and reporting on data sets.
What is the value of Cloud for Hadoop?
Cloud frameworks like OpenStack provides the most value to Hadoop when one needs to manage multiple Hadoop clusters. There are many reasons for having many clusters. Different organizations, like finance, marketing and compliance might run their own. Different versions may exist for each part of an application lifecycle, e.g. development, test, Q&A and production. Cloud IaaS makes is possible to manage all of these Hadoop environments.
The value is that the cloud significantly simplifies the operational aspects of managing Hadoop. Furthermore, Hadoop can also benefit from the general value propositions of cloud, improved agility, better resource utilization and simpler maintenance. One might argue that it is better to optimize the performance of a specific Hadoop cluster using baremetal physical servers. While that is true, it is important to understand the long term usage of Hadoop, the number of clusters and how to manage them. There will likely also be ways to get improved performance in the future using technology like lightweight cloud operating systems or Docker containers.
What is the value of Hadoop for Cloud?
Why is it that enterprises often talk about having IT that is more agile and efficient like Facebook and Google but are slow to transform? In my experience talking to enterprises and doing consultative engagements like cloud maturity assessments, I have observed that one of the biggest challenges to adopting cloud methods is legacy applications. Brownfield cloud projects tied to legacy applications are often bogged down by inflexible and heavyweight processes. An advantage of Hadoop is that it is new.
Hadoop can represent a true green field use case that presents a faster path to success. Spin up Hadoop as a cloud service and quickly gain the value of analyzing data at an unprecedented scale. Furthermore, Hadoop and cloud frameworks like OpenStack are designed for reducing infrastructure costs. Unlike some legacy applications that are designed for designed for vertical scale, requiring bigger and more expensive servers to grow, Hadoop and OpenStack are designed for horizontal scale, leveraging commodity hardware to grow. See my previous blog post on cloud application architecture for a discussion on horizontal scaling.
Use Cases for Hadoop on Cloud
The basic use case for Hadoop on cloud is self-provisioning. Spin up a Hadoop cluster on demand instead of waiting days or longer to start a new project. Hadoop-as-a-service allows one to start processing data without worrying about setting up and tuning clusters. A public cloud example o Hadoop-as-a-service is Amazon’s Elastic MapReduce offer. Enterprises can build their own private versions of that too. Technologies like those in the OpenStack “Sahara” project will allow spinning up, resizing, and spinning down entire Hadoop clusters. Such capabilities will enable setting up an elastic environment, like a time share where run can run interactive workloads during the day and batch at night. Furthermore, the pairing of Hadoop and cloud will result in multitenant solutions, where different groups use different versions of Hadoop.