You probably don't need JupyterHub on Kubernetes.
To deploy Jupyter notebooks on Kubernetes using open-source software, currently there are two major approaches to choose from:
Make notebooks a core feature on Kubernetes
This is usually done using CRD to make Kubernetes treat a Notebook as it treats a Pod or a Secret. These CR are backed up by an Operator that is aware of notebooks management logic and that will be in charge of your notebooks based on configuration you provide.
While this approach is well-integrated with the Kubernetes ecosystem, it also adds complexity and a significant maintenance burden even for those familiar with Kubernetes. It is necessary to maintain the CRD and learn how to interact with the Operator.
The most familiar and well-maintained of these is the one Kubeflow provides, the problem is that you need to deploy many other components (if not the entire stack) to get access to the Operator, but again, even if we could I still think that we don't really need an extra Operator looking after our notebooks, Kubernetes can take care of them on its own.
Run JupyterHub on Kubernetes
JupyterHub has been around for a long time, and has been serving notebooks even before Kubernetes gained momentum, As more people started running applications inside Kubernetes, JupyterHub was complelled to run, as is, inside Kubernetes as well.
This approach avoids reiventing the wheel by utilizing purpose-built adapters such as Kubespawner and those in Zero-to-jupyterhub-k8s. As a result, individuals who are accustomed to managing notebooks outside of Kubernetes won't feel lost or disoriented.
The drawback of this approach is that it relies on "glue code" to connect JupyterHub with Kubernetes, which is considered hacky and introduces feature redundancy.
-
JupyterHub relies on its own Kubespawner to spawn Kubernetes resources (Pod, PVC etc.) representing a notebook.
But why adding another Kubernetes custom client when we already use Helm?
-
JupyterHub has its own auth layer, but why not utilize Kubernetes authn/authz features for user management?
-
JupyterHub comes with its own node-http-proxy for reverse proxying, but wouldn't it be better to utilize the well-established Ingress NGINX Controller?
- JupyterHub makes use of its own jupyterhub-idle-culler to identify and shut down idle or long-running notebooks. However, it may be more efficient to use buit-in Kubernetes Horizontal Pod Autoscaler?
Enter notebook-on-kube
notebook-on-kube is a simple Python application based on FastAPI that:
-
Relies on Helm to manage notebooks, you can customize every part of your notebook via a YAML file.
-
Uses your Kubernetes OpenID Connect token on behalf of you to manage your notebooks, it reuses Kubernetes RBAC and it is more transparent.
- Deploys an Ingress NGINX Controller instance and configures it for each notebook via Ingress resources.
notebook-on-kube leverages the existing features and tools that are designed to run applications on Kubernetes, providing a third, middle-ground approach that is easy to maintain and well-integrated for managing notebooks on Kubernetes. Give it a try!
The photo below illustrates the hardware equivalent of "glue code". Whether Jupyter represents the flash drive or the system unit in the photo is open to debate. However, it is widely acknowledged that it would be significantly more convenient if we could directly plug in the flash drive.
The approach that I have demonstrated here can be extended to any other legacy software, with Kubernetes becoming the new Linux, let's make our applications Kubernetes friendly, particularly when the process is straightforward.