If you've ever worked on a machine learning project in an industry setting, you're likely aware of the significant technical debt that often plagues such projects. Over the years, the machine learning community has pushed the boundaries of the state-of-the-art in machine learning, experimenting with new architectures and learning approaches. With the rapid advancement of compute technology, researchers are now able to create models that were previously impossible to produce just five years ago. As a result, the capabilities of machine learning have undergone a significant transformation, reaching a new level of technological readiness that has prompted more people in other segments to experiment with these models.
However, as machine learning becomes more complex and mature, it becomes increasingly challenging to deploy a trained model in a production environment. While data scientists are skilled at developing valuable ML models that can enhance the value stream of clients, they are not necessarily equipped to handle IT operations and deploy models efficiently and quickly. This is because preparing a model for production requires a specialized team of engineers to create a platform that allows data scientists to focus on their core ML tasks and provide clients with more efficient service. In this article, I will explore machine learning operations (MLOps) and how it can empower both AI teams and their clients by addressing the growing challenge of deploying AI models in production.
MLOps is an approach derived from DevOps, where DevOps engineers deal with operations tasks concerning machine learning models, instead of software. In MLOps, we define CI/CD pipelines, create data automations in the cloud just as we do with DevOps, or provision infrastructure for storing data and running containers and implement additional toolchains in order to increase the velocity at which we train and deploy machine learning models for production.
Essentially, MLOps should empower data scientists to be self-sufficient enough so that they can do what they do best, autonomously, and in the way that they see fit.
A good MLOps platform needs to hide away everything else behind a mature technology stack that is used by data scientists so that they can focus on their ML experiments.
Throughout the past years of pushing the boundaries of machine learning engineering, industry and academia stumbled upon several challenges and bottlenecks when it comes to training and deploying machine learning models due to their ever-increasing complexity and compute requirements. There are 4 main areas of concern that can be identified when attempting to make an AI model ready to be used in the market:
- Cognitive overload
- Technical Debt
- Changing Requirements
- Changing Data
Let’s briefly take a look at those issues:
In terms of people, the main issue is that data scientists are highly trained and competent in utilizing machine learning development life cycles, from gathering the data to saving the trained model and reporting their findings. But what they often end up doing during most of their time is to manage compute clusters, provision cloud infrastructure, write additional data tools and applications to manage the data, create DevOps pipelines for the repository, declare APIs for the model and constantly retrain and evaluate models due to rapidly changing data and client requirements. Therefore, most AI models do not even make it into production due to the large amount of amassing technical debt faced by data scientists. It takes a very large amount of additional skills needed for a data scientist to take care of mentioned endeavors. And unless data science teams don’t want to spend the next two decades only with upskilling and experiencing cognitive overload while facing deadlines and pressure, an organizational solution is needed to alleviate these issues, protect productivity and push project velocity in AI projects. Data scientists are not supposed to worry about these additional endeavors, rather, they should focus on their machine learning experiments, which in itself is more than a full-time job. Often enough, models turn out to be useless, not because the data scientist made a mistake, but due to the highly experimental nature of machine learning. A data scientist can never know how useful a particular neural network architecture can be until the model was trained and evaluated and tested. This effectively means that they need to spend a lot of time on tracking experiments, evaluating performance, and comparing different model types on the data. These tasks need the full attention and time of a data scientist.
Aforementioned issues take us to the next challenge area called technical debt. Technical debt happens when development teams prefer speedy delivery over a more careful approach, which can cause a lot of refactoring work. If technical debt is not taken care of in time, it generates “interest”, making it even harder to make up for it the more it stacks up. But data scientists do not have the time to be responsible for both an underlying technology stack for machine learning and developing the actual machine learning projects at the same time. Neither should they be responsible. So if they should not be responsible to develop the underlying technology stack to make AI projects become scalable, then who should be? The answer is the platform engineers. The challenge is now to create a technology stack, or a so-called internal developer platform (IDP), that helps data scientists with more rapidly developing their models to quickly align them with the changing needs of clients and their data, without them having to worry about integration and deployment pipelines, task automations, performance monitoring, dashboard development and infrastructure provisioning. These tasks need to be delegated to a team of platform engineers who will solely focus on developing and maintaining such a platform, with the data scientists and machine learning engineers being treated as their customers, or in other words the consumers of the IDP, who use the platform solution and its MLOps components to run all of their machine learning projects. Here I mentioned MLOps as part of a larger IDP. Isn’t MLOps enough? Well, that depends, but if models are supposed to be integrated into systems or software, then there are projects that primarily address systems development, using the models trained by the data scientists as a feature. In this case it would be sensible to provide engineers a platform where MLOps, DevOps and any development endeavors can be integrated from a compatible and integrated development platform, abstracting away everything else that is not directly related to development.
On the client/customer side, requirements and needs may change quickly. Especially in an agile workflow, clients may ask for an important change, and it is often difficult to deliver that change in time due to the big backlog of tickets that need to be worked on for a particular release. These changes in requirements or user stories potentially mean that developers would have to change their technology stack on the infrastructure of platform level and / or retrain an ML model or modify a service. When the technology stack is immature, these changes may require too much additional time and could stack up with other challenges. A good MLOps platform empowers data scientists and ML engineers to make changes more quickly without having to worry about operational tasks such as integration & deployment, protecting their work-life balance through the elegant streamlining of release cycles that MLOps can offer.
If the data on the client side or provider side changes due to changes in the real world, we call this data drift. Data drift in turn causes model drift, which means that the performance and accuracy of the model decreases over time (there are also other reasons causing model drift), which is reflected in the gradually decreasing performance metrics such as precision, recall and f-score. In order to avoid those issues from occurring and guarantee that models always perform well, a data scientist needs to be able to quickly apply those changes to the model, so that the client’s value stream won’t suffer from the technical debt on the provider side. An immature or missing MLOps technology stack to empower data scientists and cater more rapidly and efficiently to clients will cause bottlenecks in any related value stream, causing the pace of model delivery to further drift away from the pace at which requirements change on the client side. This issue goes hand in hand with the change of client requirements. With an MLOps platform we can solve these challenges.
Throughout the past few months, we have been heavily investigating the emerging technology of MLOps, in its essential theory and the frameworks that have been developed. It is no secret the MLOps community has faced a Cambrian explosion of new available tooling, each taking care of a particular aspect of MLOps. There are dozens and dozens of solutions now that focus on specific aspects of MLOps, but this raises even more issues in terms of compatibility and people skills. Therefore, during our research, we have emphasized our efforts on finding a way to work towards developing a platform solution with the least amount of initial tooling, and the highest level of compatibility, enriched by the best open-source frameworks.
Our research phase consisted of the following steps:
- Theoretical Research: MLOps Concepts
- Research of tools and frameworks, their pros, and cons
- Tool & framework comparison (compatibility, availability, maturity, complexity)
After thoroughly investigating the theory of MLOps in a framework-agnostic fashion, we have then studied some of the most popular tooling that can be integrated into a platform solution and made comparisons.
Here’s a list of tools we have analyzed, in no particular order. Some of these tools do not directly pertain to MLOps, but they can be used to create a quality MLOps platform and the list also includes some Python libraries:
- MLFlow
- ZenML
- Kubeflow
- PyCaret
- Data Version Control
- Terraform
- Ansible
- Streamlit
- Kivy
- GraphQL
- Prometheus
- Grafana
- Elastic stack
- Docker
- Kubernetes
- Label Studio
- ONNX
- ArangoDB
- Apache Cassandra
- Apache TVM
- Apache Airflow
Of course, we couldn’t afford to make a complete deep dive into all of those tools. Therefore we scoured through opinions of other engineers and reviews, conducted some hands-on tests and experimented with small implementations, identified the key benefits and drawbacks and were strict about only considering open-source technologies in order to protect people and their skills.
We have then created a smaller shortlist by eliminating anything that shows too many drawbacks or challenges that may arise while utilizing them.
Some frameworks for example, such as Kubeflow, are often combined with tools such as RedHat OpenShift, which is not open source, so we eventually removed it from our bucket list because it feels like it pushes users into purchasing OpenShift. We have also removed Pycaret (a low-code solution built on top of famous Python-based ML frameworks such as PyTorch and scikit learn) due to the number of bugs and strict Python version requirements. In this case, we decided to develop our own workflow library for the various ML tasks we need to train models for within our projects. We also removed ZenML because implementing it seems too tedious and time-consuming and brittle. And although it was one of our first choices for the platform, we have also removed Apache Airflow, because it seems to require an unusual amount of code for implementing certain components, reducing the pace of development cycles for a future platform. We have also removed the elastic-search stack as it is not open source anymore.
So now the list for our development pipeline looks like this (subject to further change):
- MLFlow
- Data Version Control
- Terraform
- Ansible
- Streamlit
- GraphQL
- Prometheus
- Grafana
- Docker
- Kubernetes
- Label Studio
- ArangoDB
- ONNX
- Apache Cassandra
- Apache TVM
Unfortunately explaining these tools one-by-one is out of the scope of this blog, but feel free to look them up online. After several architectural revisions, we are now in the midst of developing a working MLOps platform using the remaining shortlist, including our own Python ML library built on top of popular libraries such as PyTorch, Huggingface transformers, spacy and more. By using our current approach, integrating tools & frameworks has become much easier and enjoyable, as we keep the development of components modular and extensible, replace some of the tools with our own implementation (our own low-code workflow library instead of PyCaret), from Infrastructure development with terraform, to operation and service development and internal developer tools to configure workflows and services such as our NLP Pipeline.
We are already experiencing the initial benefits of developing such a platform, as doing ML experiments for some of our ML tasks has already become much more streamlined and enjoyable, augmenting our productivity even now with a platform solution that is still under construction. MLOps is definitely the way to scale machine learning and streamline any related projects.
More content on MLOps and platform solutions will follow, stay tuned!