Discover the versatility of OpenEBS

Key points to remember

  • Running stateful workloads on Kubernetes used to be difficult, but the technology has matured. Today, up to 90% of enterprises believe Kubernetes is ready for production data workloads
  • OpenEBS provides storage for stateful applications running on Kubernetes; including dynamic local persistent volumes or replicated volumes using various “data engines”
  • Local PV data engines provide excellent performance, but at the risk of data loss due to node failure
  • For replicated engines, three options are available: Jiva, cStor, and Mayastor. Each engine supports different use cases and needs
  • OpenEBS can handle a wide range of applications, from casual testing and experimentation to high-performance production workloads

When I teach Kubernetes training, a chapter invariably comes at the end of the training and never before. This is the chapter on stateful sets and persistent storage, in other words, running stateful workloads in Kubernetes. While running stateful workloads on Kubernetes was previously impossible, up to 90% of enterprises now believe K8s is data-ready. The final lab in this chapter is to run a PostgreSQL benchmark (which continues to write to disk) in a pod, then break the node running that pod and demonstrate various mechanisms involved in failover (such as evictions based on releases). Historically, I’ve used Portworx for this demo. Recently I decided to try OpenEBS.

In this article, I’ll give you my first impressions of OpenEBS: how it works, how to get started with it, and what I like about it.

OpenEBS provides storage for stateful applications running on Kubernetes; including dynamic local persistent volumes (like the Rancher local path provisioner) or volumes replicated using various “data engines”. Like Prometheus, which can be deployed on a Raspberry Pi to monitor the temperature of your beer or sourdough cultures in your basement, but also scaled to monitor hundreds of thousands of servers, OpenEBS can be used for simple projects, quick demos, but also large clusters with sophisticated storage needs.

OpenEBS supports many different “data engines”, which can be a bit overwhelming at first. But these data engines are precisely what makes OpenEBS so versatile. There are “local PV” engines that typically require little or no configuration, provide good performance, but exist on a single node and become unavailable if that node fails. And there are replicated engines that provide resilience against node failures. Some of these replicated engines are very easy to configure, but the ones that offer the best performance and functionality will require a bit more work.

Let’s start with a quick review of all these data engines. The following does not replace the excellent OpenEBS documentation; but it is rather my way of explaining these concepts.

Local PV Data Engines

Persistent volumes using one of the “local PV” engines are not replicated across multiple nodes. OpenEBS will use the node’s local storage. Several variations of local PV engines are available. It can use local directories (used as HostPath volumes), existing block devices (disks, partitions or other), raw files (ZFS filesystems allowing advanced features such as snapshots and clones) or Linux volumes LVM (in which case OpenEBS works similarly to TopoLVM).

The obvious disadvantage of local PV data engines is that a node failure will cause the volumes on that node to become unavailable; and if the node is lost, so is the data that was on that node. However, these engines have excellent performance: since there is no overhead in the data path, the read/write throughput will be the same as if we were using the storage directly, without containers. Another advantage is that the Host Path Local PV works immediately – without requiring additional configuration – when installing OpenEBS, similar to the Rancher Local Path Provisioner. Extremely handy when I need a storage class “right away” for a quick test!

Replicated Engines

OpenEBS also offers several replicated engines: Jiva, cStor and Mayastor. I’ll be honest, I was quite confused at first: why do we need not one, not two, but three replicated engines? Let’s find out!

Jiva engine

The Jiva engine is the simplest. Its main advantage is that it does not require any additional configuration. Like the host path’s local PV engine, the Jiva engine works as soon as OpenEBS is installed. It provides strong data replication. With the default settings, each time we provision a Jiva volume, three storage pods will be created, using a scheduling placement constraint to ensure they are placed on different nodes. This way, a single node failure will not delete more than one volume replica at a time. The Jiva engine is simple to use, but lacks the advanced features of other engines (such as snapshots, clones, or adding capacity on the fly) and the OpenEBS docs mention that Jiva is suitable when “capacity needs are low” (as below 50 GB). In other words, it’s fantastic for testing, labs, or demos, but maybe not for that giant production database.

cStor engine

Next on the list is the cStor engine. This one brings us the previously mentioned additional features (snapshots, clones, and adding capacity on the fly), but it takes a bit more work to get it working. Namely, you need to involve the NDM, the Node Disk Manager component of OpenEBS, and you need to tell it which available block devices you want to use. This means that you should have free partitions (or even entire disks) to allocate to cStor.

If you don’t have an additional disk or partition available, you may be able to use loop devices instead. However, since looping devices incur a significant performance overhead, you might as well use the Jiva provisioner instead in this case, as it will achieve similar results but will be much easier to configure.

Mayastar engine

Finally, there is the Mayastor engine. It is designed to work closely with NVMe (non-volatile memory express) drives and protocols (it can still use non-NVMe drives though). I was wondering why this was such a big deal, so I dug a bit.

In older storage systems, you could only send one command at a time: read this block or write this block. Then you had to wait for the order to complete before you could submit another one. Later, it became possible to submit multiple commands and let the disk rearrange them to execute them faster. for example, to reduce the number of head seeks using an elevator algorithm. In the late 90s, the ATA-4 standard introduced TCQ (Tagged Command Queuing) into the ATA specification. This was greatly improved, later, by NCQ (Native Command Queuing) with SATA drives. SCSI drives had a longer command queue, which is also why they were more expensive and more likely to be found in high-end servers and storage systems.

Over time, queuing systems have evolved a lot. Early standards allowed a few dozen commands to be queued in a single queue; now we are talking about thousands of orders in thousands of queues. This makes multicore systems more efficient because queues can be tied to specific cores and reduce contention. We can now also have priorities between queues, which can ensure fair disk access between queues. This is ideal for virtualized workloads, to ensure that one virtual machine does not starve the others. Importantly, NVMe also optimizes CPU usage related to disk access, as it is designed to require fewer round trips between the operating system and the disk controller. While there are certainly plenty of other features in NVMe, this queuing activity alone makes a big difference; and I understand why Mayastor would be relevant for people who want to design storage systems with the best performance.

If you need help determining which engine best suits your needs, you’re not alone; and the OpenEBS documentation has a great page on this.

Storage container attached

Another interesting thing in OpenEBS is the concept of CAS, or Container Attached Storage. The wording made me raise an eyebrow at first. Is this a marketing trick? Not enough.

When using the Jiva replicated engine, I noticed that for each Jiva volume, I got 4 pods and a service:

  • a “controller” pod (with “-ctrl-” in its name)
  • three “data replica” pods (with “-rep-” in its name)
  • an exposing service (on different ports): an iSCSI target, a Prometheus metrics endpoint and an API server

This is interesting because it mimics what you get when you deploy a SAN: multiple disks (the data replica pods) and a controller (to interface between a storage protocol like iSCSI and the disks themselves). These components are materialized by containers and pods, and the storage is actually inside the containers, so the term “container attached storage” makes a lot of sense (note that storage doesn’t necessarily use container storage from copy-on-write; in my configuration, by default, uses a hostPath volume; however, this is configurable).

I mentioned iSCSI above. I found it reassuring that OpenEBS uses iSCSI with cStor, as it is a solid and tested protocol widely used in the storage industry. This means that OpenEBS does not require a custom kernel module or anything like that. I think this requires some user tools to be installed on the nodes though. I say “I believe” because on my Ubuntu test nodes with a very simple cloud image, I didn’t need to install or configure anything extra anyway.

After this quick overview of OpenEBS, the most important question is: does it meet my needs? I found its wide range of options meant it could handle just about anything I threw at it. For training, development environments, and even modest staging rigs, when I need a turnkey dynamic persistent volume provider, local PV Engines work great. If I want to withstand node failures, I can take advantage of the Jiva engine. And finally, if I want both high availability and performance, all I have to do is invest minimal time and effort in setting up the cStor engine (or Mayastor if I have fancy NVMe devices and that I want to extend their performance to the maximum). Being both a teacher and a consultant, I appreciate being able to use the same toolkit in the classroom and for my clients’ production workloads.