Setting up your Hadoop cluster in the cloud

By June 16, 2017 January 6th, 2018 Big Data, Blog, Hadoop

In my previous Hadoop posts, I discussed if data locality is really critical, and Hadoop operational challenges. In this blog, we answer 5 simple questions to help decide if you should set up your Hadoop cluster on-premises or in the cloud.

We have been receiving pretty good response on the joint webinar we did with Hortonworks (Hadoop) couple of weeks back. In case you have missed the live stream, do not sweat. We have the recording available for you.

With that said, something came up after the webinar from a potential customer that has really intrigued me – He joined us live during the webinar, sat through the presentation and he says he really likes the advantages the Robin platform delivers. He wants to use Hadoop for his real-time analytics platform but he is not able to make up his mind whether to use cloud or use bare metal servers on-premises to set up the Hadoop clusters.

His conundrum, “Should we create our Hadoop cluster on-premises on in the public cloud?”

We exchanged few emails and then I got on the phone with him. We discussed the following 5 questions to help him decide. These are quite generic and applicable to any data heavy apps that you are looking to implement. Ask yourself these questions, answers to these should help you make your decision if you are in a similar dilemma.

1. Where is your data?

A lot depends on the answer to this question. If the purpose of your cluster is to perform real-time analytics on web logs or other streaming data that already resides in the cloud then creating a cluster in the cloud makes more sense. As the volume, velocity and the variety of data increase it makes more sense to analyze the data closer to its source. Faster analysis leads to faster decisions.

Having said that, if most of your data resides on-premises it is perfectly fine to build your cluster on-premises.

2. Is this a pilot project?

Many times, we come across customers who are just kicking the tires with Hadoop. The end goal is not completely known and the customer wants to run a pilot project with a small subset of data. In these cases, waiting for your IT ops to provision infrastructure before you can even start the project results in longer delays. If you want to cut down on the time to market and want to get going really quickly then the cloud is definitely the better choice.

3. Do you know how many nodes will be needed after 6 months?

Capacity planning (read right sizing) is often a big challenge. How do you anticipate growth? You may have a good ballpark on storage estimates but more often than not our customers end up overestimating CPU (compute) requirements. As a result, they are stuck with underutilized hardware. Cloud alleviates this problem as it allows you to scale out based on your actual growth. So you will allocate only the resources that you will need and size your cluster accordingly.

4. Are you on a tight budget?

The answer to this question is Yes for almost every customer I have come across. Almost every customer operates on a shoestring budget these days. If you want to start with a small footprint with a small budget, the cloud is the place to start. You can start small, make sure to allocate only the right amount of resources that is needed to kick-start your project. Cloud helps you to stay frugal!!

5. Do you want to simplify your application management?

Cloud provides the flexibility to provision your cluster with different configurations depending on the data you are trying to analyze. While setting up Hadoop cluster on a set of bare-metal boxes can be really challenging, various services are available in the cloud that makes this really easy.

We are bringing our ROBIN Hyper-Converged Kubernetes Platform to the public cloud. Very soon you will be able to setup your Hadoop cluster with just a few clicks of the mouse on your favorite public cloud platform.

Coming Soon

ROBIN Hyper-Converged Kubernetes Platform Community Edition*

RCP Community Edition (CE) is ideal for small DevOps teams looking to get started with RCP and experimenting with container-based apps.

  • Designed to run any linux application, especially stateful applications like big data and databases
  • Spin up clusters within minutes
  • Includes QoS, Scaling, Cloning, Snapshots, Time Travel
  • “Free for Life”; auto-deployment on AWS with up to 5 nodes (any size)

*CE includes pre-packaged Cassandra, MongoDB, Elasticsearch and Big Data bundles – Hadoop (Hortonworks, Cloudera)

While you wait for the RCP Community Edition, here is a sneak peak of how easy it is to setup and manage an HDP cluster on AWS.

HDP on ROBIN Hyper-Converged Kubernetes Platform Setup on AWS


Author Deba Chatterjee, Director Products

More posts by Deba Chatterjee, Director Products