Taming Cassandra with Containers: Multi-tenancy, Guaranteed Performance, Rapid Copies and more

Container-based virtualization and microservice architecture have taken the world by storm. Applications with a microservice architecture consist of a set of narrowly focused, independently deployable services, which are expected to fail. The advantage: increased agility and resilience. Agility since individual services can be updated and redeployed in isolation. While given the distributed nature of microservices, they can be deployed across different platforms and infrastructures, and the developers are forced to think about resilience from the ground up instead of as an afterthought. These are the defining principles for large web-scale and distributed applications, and web companies like Netflix, Twitter, Amazon, Google, etc have benefitted significantly with this paradigm.

Add containers to the mix

Containers are fast to deploy, allow bundling of all dependencies required for the application (break out of dependency hell), and are portable, which means you can truly write your application once and deploy it anywhere. Microservice architecture and containers, together, make applications that are faster to build and easier to maintain while having overall higher quality.

ROBIN Hyper-Converged Kubernetes Platform to manage Cassandra management challenges

Image borrowed from Martin Fowler’s excellent blog: http://martinfowler.com/articles/microservices.html

A major change forced by microservice architecture is decentralization of data. This means, unlike monolithic applications which prefer a single logical database for persistent data, microservices prefer letting each service manage its own database, either different instances of the same database technology or entirely different database systems.

Unfortunately, databases are complex beasts, have a strong dependence on storage, have customized solutions for HA, DR, and scaling, and if not tuned correctly will directly impact application performance. Consequently, the container ecosystem has largely ignored the heart of most applications— storage — and thus limit the benefits of container-based microservices due to the inability to containerize stateful & data-heavy services such as databases.

The majority of the container ecosystem vendors have focussed mostly on stateless applications. Why? Stateless applications are easy to deploy and manage. For example, they have the ability to respond to events by adding or removing instances of a service without needing to significantly change or reconfigure the application. For stateful applications, most container ecosystem vendors have focussed on orchestration, which only solves the problems of deployment and scale, or existing storage vendors have tried to retrofit their current solutions for containers via volume plug-ins to orchestration solutions. Unfortunately, this is not sufficient.

ROBIN Hyper-Converged Kubernetes Platform

Robin is a container-based, application-centric, server and storage virtualization platform software which turns commodity hardware into a high-performance, elastic, and agile application/database consolidation platform. In particular, Robin is built for data applications such as databases and big data clusters as it provides all the benefits of hypervisor-based virtualization but with bare-metal performance (up to 40 percent better than VMs) and application-level IO resource management capabilities such as minimum IOPS guarantee and max IOPS caps. Robin also dramatically simplifies data lifecycle management with features such as one-click database snapshot, clones, and time travel.

ROBIN Hyper-Converged Kubernetes Platform to manage Cassandra management challenges

To dive deeper into this, let’s take the example of Cassandra, a modern NoSQL database, and look at the scope of management challenges that need to be addressed.

Cassandra Management Challenges

While poor schema design and query performance remain the most prevalent problems, they are rather application and use case specific, and require an experienced database administrator to resolve. In fact, I would say most Cassandra admins, or any DBA for that matter, enjoy this task and pride themselves at being good at it.

The management tasks which database admins would rather avoid and have automated are:

  1. Low utilization and lack of consolidation
  2. Complex cluster lifecycle management
  3. Manual & cumbersome data management
  4. Costly scaling

Let’s look at these one by one.

1 – Low utilization and lack of consolidation
Cassandra clusters are, typically, created per use-case or SLA (read intensive, write intensive). In fact, the common practice is to give each team its own cluster. This would be an acceptable practice if clusters weren’t deployed on dedicated physical servers. In order to avoid performance and noisy neighbor issues, most enterprises stay away from virtual machines. This, unfortunately, means that underlying hardware has to be sized for peak workloads, leaving large amounts of spare capacity and idle hardware due to varying load profiles.

All this leads to poor utilization of infrastructure and very low consolidation ratios. This is a big issue for enterprises on both – on-premise and in the cloud.

Underutilized servers == Wasted money.

The Robin Advantage – Robin uses containers to provide 1-click, rapid, self-service deployment of Cassandra and DataStax clusters. It also compliments containers with its Quality of Service feature to provide complete performance isolation. This means only Robin allows multiple applications to run on the same infrastructure without them impacting each other; thus increasing the average hardware utilization and delivering significantly larger consolidation ratios. Typically, customers see over 40% reduction in hardware by adopting the ROBIN Hyper-Converged Kubernetes Platform.

ROBIN Hyper-Converged Kubernetes Platform to manage Cassandra management challenges

ESG Review - ROBIN Hyper-Converged Kubernetes Platform to manage Cassandra management challenges  ESG Lab ReviewContainer-Based Virtualization for Databases and Data-Centric Applications

2 – Complex Cluster Lifecycle Management
Given the need for physical infrastructure (compute, network, and storage), provisioning Cassandra clusters on-premise can be time-consuming and tedious. The hardest thing about this activity is estimating the read and write performance that will be delivered by the designed configuration, and hence often involves extensive schema design and performance testing on individual nodes.

Besides initial deployment, enterprises also have to cater to node failures. Failures are the norm and have to be planned for from the get go. Node failures can be temporary or permanent and can result from various reasons – hardware faults, power failure, natural disaster, operator errors, etc. While Cassandra is designed to withstand node failures, it still has to be resolved by adding replacement nodes, and it poses additional load on the remaining nodes for data rebalance – post failure and again post addition of new nodes.

The Robin Advantage – Robin’s orchestration capabilities make deployment and right-sizing of large and complex clusters a breeze. Administrators can now dynamically and in real-time make changes to the CPU, memory, and read and write IOPs assigned to the individual clusters. Robin also provides automatic failover for all nodes of the cluster, thus eliminating the notion of node failures.

3 – Manual Data Management
Unlike traditional databases such as Oracle, Cassandra does not come with utilities that automatically back up the database. Cassandra offers backup in terms of snapshots and incremental copies, but they are quite limited in features. Most notable limitations of snapshots are:

  • Require hard links to store point-in-time copies
  • Use the same disk as data files (compaction makes this worse)
  • Are per node
  • Do not include Schema backup
  • No support for off-site relocation
  • Have to be removed manually

Similarly, data recovery is fairly involved. Data recovery can be necessary for two reasons:

  1. To recover database from failures. For example, in the case of data corruption or loss of data due to an incorrect ”truncate table.”
  2. To create a copy of it for other uses. For example, to create a clone for dev/test environments to test schema changes.

ROBIN Hyper-Converged Kubernetes Platform to manage Cassandra management challenges
Typical steps to recover a node from data failures

To optimize space used for backups, most enterprises will retain last 2 or 3 backups on the server but will move the rest to a remote location. This means based on the data sample needed, you may be able to restore locally on the server or have to move files around from a remote source.

While Datastax Enterprise edition does provide the ability to schedule backups via OpsCenter, it still necessitates careful planning and execution.

The Robin Advantage – Robin offers cluster-wide, automated backup, restore, and cloning of Apache Cassandra and DataStax Enterprise. Robin snapshots are instantaneous and storage efficient, and thus greatly simplifies storage planning and operations, and significantly reduces the cost of maintaining large clusters in production and dev/test environments.

Cassandra: Snapshot, clone and time travel – Watch demo.

4 – Costly Scaling
With Cassandra’s ability to scale linearly, most administrators are quite accustomed to adding nodes (or scale out) to expand the size of clusters. With each node, you gain additional processing power and data capacity. But while node addition is required to cater to a steady increase in database usage, how does one handle transient spikes?

Let’s look at a scenario. Typically once a year, most retail enterprises will go through the planning frenzy for the post-Thanksgiving holiday shopping deluge. Unfortunately, after that event, the majority of the infrastructure would be idle or require administrators to break the cluster and repurpose it for other uses. Wouldn’t it be interesting if there were a way to simply add and remove resources dynamically and scale the cluster up or down based on transient load demands?

The Robin Advantage – Robin provides multiple scaling options. For scale out, it implicitly provisions both the application software and the infrastructure required to add new nodes to the cluster. While for transient spikes in workload due to expected or unexpected changes in application usage, Robin adds the new paradigm of scaling clusters up and down. This means you can dynamically and in real-time change resource allocation for CPU, memory, and IOPs to individual clusters.

Cassandra: Quality of Service – Watch demo.

Many enterprises have experimented with different virtualization technologies like VMware, Docker containers, orchestrators such as Mesos and Kubernetes, etc. but they soon discover that these tools along with their basic storage support only solve the problem of deployment, and are unable to address challenges with database systems in terms of failover, data and performance management, and the ability to take care of transient workloads. Only a platform, such as ROBIN Hyper-Converged Kubernetes Platform, which is designed for distributed database and big data applications, can help address Cassandra management challenges and deliver higher productivity while lowering cost of infrastructure and operations.


Author Adeesh Fulay, Director Products

More posts by Adeesh Fulay, Director Products