With the tremendous growth of Kubernetes across different industries, more and more demands on the customizable Kubernetes arise to make Kubernetes more adaptable to their working environment. Technically speaking, it won’t be too challenging since Kubernetes has great extensibility and is highly configurable. Since then It’s also been a while that community contributors have been trying to encapsulate those tailored requirements and establish a common pattern to make Kubernetes keep its fast-growing pace and meanwhile heading in the right direction.
One of the most challenging topics in the cloud-native world has been cloud-native data management. I am very energized by the Kubecon Europe 2021 event that happened last week, and many valuable takeaways from Cloud Native Data Management ( CNDM ) Days. Aside from storage, backup and many other important things, personally I’m quite interested in the cloud native data analytics capabilities development. Therefore, how to make this fits into the big picture of cloud-native vision, the answer is hands down the Operators. That’s why I was so motivated and couldn’t wait to get my hands dirty with the recent release of Spark on K8S Operator provided by Google Cloud Platform(GCP).
What is Operator and value-added to the Cloud Native world
Before getting to the review, I think it would be useful to have a quick look at the definition of an Operator and the Operator Framework. Personally I like the definition of operator given by RedHat describing that operator is an interpretation of ‘human operational knowledge’ into cloud native application management and reduces the manual repetitive tasks.
An operator is a complete automation of packaging, deploying and managing a Kubernetes-native application, it represents a custom controller and interacting with Kubernetes API server, and works with Custom Resource Definitions ( CRDs) which are extensions of Kubernetes API, it can be updated independently without interrupting Kubernetes itself. The operator pattern makes Kubernetes more sustainable and lively, and users don’t need to master additional toolings in addition to kubectl or yaml definition.
The Operator framework even accelerates the growth of Operator by more value-adding introduction :
- Operator SDK : Enable developers to build, test and package Operators.
- Operator Lifecycle Management (OLM) : Operator lifecycle management from installation, upgrade and go above beyond.
- Operator Metering : Operator usage reporting.
Apart from that, there’s also a Operator Hub which is a web console for developers and cluster administrators to discover and follow instructions to onboard the operators to their environment. It is a public-facing website and is also integrated with OpenShift or Azure RedHat OpenShift(ARO) web console where you gain quick access to those operators.
Also operator maturity model has been defined clearly based on the level of sophistication of the management logic encapsulated within an Operator as the following :
Kubernetes is a highly vibrant and fast-growing community, the contributions towards the operator will put Kubernetes on a more sustainable path by lowering the bars on technical knowledge with automation, and also make it easier to integrate with DevOps practise. The operators also make the future path with more variety and more adaptable to different working environments which will accelerate more significant growth both in enterprise adoption and community contribution.
Setting up the playground with Kind in WSL 2
I evaluated the operator with what I think would be the simplest setup, I had Docker and WSL2 installed with my Windows 10 workstation prior, you can follow the guide here to do the same. Since WSL 3 shipped a full linux kernel built by Microsoft, it allows natively running Linux containers and it would consume the real GPU when you need a GPU Accelerated Machine Learning experience. The rest you need to make sure of is to set up your Docker Desktop WSL 2 backend point to the Linux distro that you design and you’ll see a similar as follows :
Then with Kind you can literally spin up a single node local Kubernetes cluster in a matter of seconds, and it may worth mentioning that Dockershim deprecation wouldn’t impact Kind because it already supports containerd runtime for the worker nodes. My setup was as simple as the following: 1 node with a relatively recent Kubernetes version, we’re good to go.
Installing Spark on K8S Operator
Having those setup, we just need to install the GCP Spark Operator. The Spark Operator installation can be done by using Helm charts with an option to set the namespace while installing using –set flag, however, some advanced features implemented by mutating admission webhook will need to be enabled manually. And another way is to install from Operator Hub website, thus the Operator Lifecycle Manager (OLM) needs to be installed first, and you’ll claim the benefits to have Operator catalog so the OLM knows where to download the Operator. And then you can simply use a kubectl create -f command to install the Spark operator :
By doing that, you can also start to watch the ClusterServiceVersion (CSV) pods by using kubectl get csv -n namespace -w which contains the application metadata spinning up in the namespace that you’re installing the operator.
And when it managed to complete, you can see the operator pod is up and running in that namespace :
Since the Operatorhub is integrated with the WebConsole for OpenShift/ARO, which allows you to install the Operator by a few clicks. You can also choose to install the Operator to all the namespaces ( default installation mode) or install it to a specific namespace on the cluster. After a successful installation, you can go to the installed Operator and choose Spark Operator and we can the provided API allows us to create an instance for Spark Application or Scheduled Spark Application through the UI by filling the form ( mouse-cursor experience ) or change in the yaml file provided in the UI and then simply go submit it. The OperatorHub is very practical for some users who prefer UI or want a quick testing purpose.
Comparison evaluation by submitting Spark job with Kubernetes-native experience
Before testing the Operator, I think it would be great to get a comparison by running Spark jobs on Kubernetes, since it’s been improved significantly since the Spark 2.3 that supports Kubernetes as a native scheduler backend and even better Kubernetes support experience with the arrival of Spark 3.0 back to last June. Therefore I picked up a Scala-based sample application( SparkPi app) from Spark samples Github repository and then compiled it to Jar with sbt assembly plugin that resides on Azure Blob Storage and used the compiled version of Spark source code on Github then push it to my Azure Container Registry (ACR). The main capability to submit a Spark job from the Spark-submit module, the command would very much look like the following :
./bin/spark-submit \ --master k8s://http://127.0.0.1:8001 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=your-container-registry-name/your-spark-source:your-image-tag \ https://yourblobstorage.blob.core.windows.net/yourblobcontainer/your-assembly-0.1.0-SNAPSHOT.jar
Notice that from the command, we can specify the number of the executors instances which are configured with a default of 2, this job scheduling setting is related to dynamic resource allocation. In the case of a standalone Spark cluster deployed with a standalone scheduler or Mesos, and Yarn), it also manages the autoscaling of instances in the Spark cluster. Once submitted, you can see the Spark driver pod spinning up and running, then initiating the Spark executor pods, when the Spark job has finished running, the driver pod status shows a completed state.
From here I was very conscious about the time that I spent on some repetitive containerization-related tasks up until pushing compiled Spark to the container registry before actually going to the Spark job submission.
Spark on Kubernetes Operator experience review
The use of Spark Operator can optimize at the best that the time that I spent dealing with Spark and container-related work. Here I used the same Spark application from my last run-through for native Spark job submission, once the Spark Operator is installed, and I get to spend more quality time straight up on Spark job definition.
Based on the Spark on K8S Operator design specification, the Operator uses Kubernetes as a native scheduler backend for Spark application lifecycle management, and with Mutating Admission Webhook to handle the customization for Spark driver and executor pods.
The Spark Operator provided by Google Cloud Platform is based on Spark 2.4 ( Api Version v1beta1), it supports both one-time and cron-scheduled Spark Application. As of today, the examples on the Github repo have been updated to Spark 3.1.1 and the latest CRD version ‘v1beta2’. You can simply deploy those CRDs by using kubectl apply -f command and specify the namespace in the yaml file or add -n flag in the command, the Spark job will be submitted and spin up the driver pods and then to the executor pods, both can be specified in the yaml definition.
Like all the other operators in OperatorHub, it provided declarative management by yaml file, the Custom Resource Definition (CRD) examples can be found here. After you deploy the CRDs, you can also use kubectl get sparkapplications to check all the Spark applications have been submitted.
Another command useful is to check the status of your spark applications by using kubectl describe sparkapplication yousparkapp
The coolest thing of my experience was I could use all the kubectl native commands to monitor the status or logs of the pods, and it could’ve been more powerful with monitoring framework such as Prometheus or Grafana Lab / Loki, GCP Spark Operator provided a yaml example with Prometheus, believable more yet to come.
By exploring native Kubernetes Spark job submit experience and Spark on K8S Operator, I had a glimpse about how those human practises and Kubernetes technology would help the Data Engineers, Data Scientist in their daily life and how much it would boost the productivity and thrive the community once fully integrated with DevOps practices, or even further with DataOps and Team Data Science Process(TDSP).
My next article will be more about the Spark 3.0 with Kubernetes with more focus on the scheduling capabilities working in the distributed computing environment, let’s stay tuned !