I was introduced to Apache Druid a year and a half ago. During this time I've focused on operationalizing Apache Druid on Kubernetes (K8s). I started with Helm Charts to spin up Druid clusters in this complex distributed Druid + K8s system, but I realized Helm Charts alone were not enough.
I’ve written Golang-based operators, custom controllers in Kubernetes for different use cases, and contributed various oss operators so I was familiar with extending Kubernetes using Custom Resource Definitions (CRDs). I was thrilled to discover the Druid Operator which had just been open sourced in the Druid community in late 2019. The project was less than a month old when I started contributing to it.
But first...
Why Kubernetes Operator for Druid?
Kubernetes is an orchestration engine which can run and manage containerized applications. Each application has different ways to autoscale, handle ingress and egress, manage updates, and other operational tasks. Kubernetes provides a generic solution for all the applications.
Apache Druid cluster consists of multiple node types i.e. coordinator, overlord, router, broker, historical and middlemanager. While all these can be easily installed initially using simple Helm charts, the ongoing maintenance of these nodes becomes a nightmare. There is complexity in managing multiple node types as you manually need to monitor and manage rolling upgrades for each of the components in a specific order.
Druid Operator understands Druid internal concepts and manages the cluster for better uptime, high availability, seamless rolling upgrades, and easier management.
Druid Operator seamlessly does the hard work of maintaining the state of your application and autoscaling the Druid nodes in a continuous way without causing any downtime in the cluster.
An operator understands the application and opens endless opportunities to build the application logic inside Kubernetes. It’s important that Kubernetes knows it is not just spinning up a pod, but it's spinning up a Druid pod meeting all the Druid operational needs.
Installing Druid Operator
Druid Operator supports Helm Chart and manifests based installation. The Helm charts installation also adds the appropriate Role-based Access Control, and the deployment for the operator.
The command below installs the Druid Operator.
The Custom Resource Definition (CRD) is where we define Group, Version, and Kind; while a Custom Resource (CR) is created from the CRD. The details of Druid CRD are as follows:
The operator also creates a service account, and rbac during installation to access the kubernetes api. Helm chart shall install all those, in case you want to run the operator cluster scoped, role’s will be changed to cluster roles and cluster role bindings.
For further information around Druid operator refer to this doc: https://github.com/druid-io/druid-operator/blob/master/docs/getting_started.md
Creating your first Druid Cluster with Operator
Operator supports both StatefulSets and deployments for six basic node types:
- Router
- Coordinator
- Overlords
- Historicals
- Middlemanagers
- Brokers
Historicals can be further divided into multiple tiers such as hot / cold. You can also run coordinator and overlord in the same container / pod itself for simplicity.
Design the CR Spec for Apache Druid
The CR has a Cluster Spec and nodeSpec. Cluster Spec has fields common to all the Druid nodes whereas nodeSpec has fields specific to each node distinguished by Druid nodeTypes.
The nodeSpec is map[string]nodeSpec, so multiple nodeSpecs can be created, each being unique. Each nodeSpec has a Druid nodeType key as mentioned above.
Druid Operator includes a sample CR spec here which can be used as a quickstart for a tiny Druid cluster.
Run below commands to install the sample tiny cluster.
Common runtime properties, which can be commonly used by all Druid nodes, are defined in the cluster spec, whereas runtime properties are specific to each node.
You can optionally specify kind (deployment or stateful sets) in the nodeSpec. The operator defaults to creation of StatefulSets. If you need to use deployment instead of statefulset, just add kind: Deployment to the nodeSpecs.
- Coordinators, brokers, and routers are stateless in nature, thus deployments make more sense and avoid complexities to manage statefulsets.
- Historicals and middlemanagers are stateful in nature, and statefulsets are recommended for these node types where we need to recover the data on the disks.
The operator also validates the spec, and certain fields are required for the CR to be reconciled by the operator. The following keys must exist:
- common.runtime.properties - common druid properties
- commonConfigMountPath - mount path to mount the common properties config map
- startScript - command to start the container
- image - image for the CR, image can be present for each node or can be common to all.
- runtime Properties - specific to each node
- nodeConfigMountPath - mount path to mount the runtime properties for each node
For further information refer to the Druid Operator Druid CR Spec.
Druid Operator Features
Rolling Upgrades
Druid Operator can install or upgrade Druid nodes in parallel or in rolling deploy fashion, as per the order defined in the Druid docs.
When rollingDeploy is set to ‘true’, the operator upgrades the nodes in order. Operator hard-fails in the case of error state and saves the cluster to go down entirely.
You can see the progress of the upgrades through the events from the Druid CR by running following command:
These events can be listened to by various monitoring tools and you can create alerts around it in case of failures.
Selective Node Upgrade
Changes made to the CR triggers the operator reconciler to act. Druid Operator is smart to not restart all the Druid components on every rolling deploy. If a change is made to the common spec, it will update all the node types. If a change is made specific to only one node type, the operator will only update that individual node.
Automated Cleanup of PVC using Finalizers
Finalizers is a Kubernetes concept where a deletionTimeStamp is set on a k8s object. After Finalizer pre-delete hooks run, then the object is deleted.
Typically on deletion of a StatefulSet, PersistentVolumeClaims (PVCs) are not cleaned up—this is a manual task. Druid Operator defaults to implementing a finalizer for the deletion of StatefulSets.
Automated cleanup of PVCs on the deletion of the Druid CR created by the underlying statefulsets.
You can optionally disable this feature by adding a boolean field named disablePVCDeletionFinalizer in their specs.
Manage Orphaned PVCs
Typically as you manage Druid you scale middlemanagers up and down based on the set of ingestion tasks that are needed to run. Let's consider a case when you scale middlemenager from 1 to 10 and vice versa. When your MM StatefulSets are scaled back to 1, there are 9 PVCs left in a hanging state, adding unused resources and increasing unutilized costs.
Druid Operator transparently deletes PVCs which are orphaned by any StatefulSet, regardless of nodeType.
This feature is disabled by default, You can enable this feature by setting deleteOrphanPvc to true.
Self Healing StatefulSets
When a StatefulSet with a pod management policy (OrderReady) crashes it is possible to get into a broken state that requires manual intervention.
This issue could hamper the rolling upgrades of the operator.
To self heal Druid Operator detects the crash, deletes the pod in order to recover it, and loads the new config / pod changes.
Seamless scaling of StatefulSet Volumes
In order to scale a StatefulSet VolumeClaimTemplate, users need to perform various manual checks to scale the PVC, since StatefulSets do not allow any changes apart from the pod management policy, podSpec, and the template.
Behind the scenes, the operator performs a state change detection of the Custom Resource (CR) and if one is found it does a non-cascade deletion of the StatefulSets and reconciles their creation. All this is done completely transparently by the operator without requiring any manual intervention.
Zero down time, no deletion of PVCs, and uninterrupted user experiences.
High Availability for Druid Operator
Operators / controllers can run in High Availability (HA), as per operator pattern. Only a single controller shall be active at a time which shall watch / reconcile events from the API.
The other operator pods shall remain as standby acting as non-leaders. Behind the scenes locks are acquired between the pods to act in HA and a lease is acquired.
The Druid Operator uses the functionality provided by Kube builder scaffold to acquire locks and provide HA.
Kubectl Druid Plugin
You can also use Kubectl Druid plugin to simplify managing your druid deployments even further. It is a plugin which extends kubectl to make it compatible with Druid CR and simplify operations on the Druid CR.
Below are a set of commands provided by the Kubectl Druid plugin -
The next blog will go in depth on the technical details of the operator with respect to the Kubernetes ecosystem. We’ll also share how you can get involved in the community and contribute directly to the Druid Operator.