What did I learn speaking to companies exhibiting at ODSC Boston this Spring? (for the Apr 23-24, 2024 conference).
Naturally, my observations are centered around my interest for data and AI infrastructure: data lakes, ingestion, machine learning datasets, training, evaluation.
Metaflow
Ville Tuulos, Co-Founder and CEO of Outerbounds, gave me a demo of the Metaflow platform.
Metaflow is an AI flow orchestrator project for Python. It was spawned off, as open-source, from Netflix. It is paired with an Enterprise version, but much can be achieved simply by hosting your own open source version, either on a local workstation – or on a local Kubernetes cluster, or on a cloud environment running Kubernetes.
- Metaflow classes inherit from base classes (FlowSpec, Step, BatchStep), and class methods use Python decorators (@step, @batch, @parameter, @card, @environment, @kubernetes… ).
- The decorators define the flow steps, as well as the environment the steps are executed.
- Steps can fan out in parallel. And parallel steps can be followed by a joint step.
One of the powers of the system is – you can develop locally – and, when happy with the flow of the steps, you can decorate a step to run in parallel in a containerized Kubernetes cluster, with specific GPU pods.
Metaflow is well integrated with AWS – no surprise, given that Netflix, the originator of Metaflow, is a heavy user of AWS.
Docker containers in Metaflow. Comparison with Databricks.
Metaflow maintains its own Docker containers, with pinned versions. According to Tuulos, this is very helpful for developers, simplifying their Python module version management.
“People say AI will solve everything,” quipped Tuulos, “but I have yet to see AI solving the Python module versions selection!”
I’ve pointed out to Tuulos that Databricks (a platform I am pretty familiar with) also pins Python module versions in its runtimes – and this is great help for developers. But, I said, Databricks runtimes actually seem to use LXC containers! Developers can bring their own Docker container to Databricks, but it will seemingly run on top of LXC!
Tuulos says Apache Spark, on which Databricks is based on, originated before Docker became prevalent, and that there is still, perhaps, pre-Docker infrastructure in the Databricks stack.
My experience, however, has been that it’s hard to build your own Docker container for Databricks. Containers can easily become very large, when using PyTorch, and then they time out loading onto the Databricks platform. At least, that was a problem in the Fall of 2023 – which may have been fixed.
Anyhow, Metaflow does not have these problems, and it’s possible to use either a standard Docker container maintained by Metaflow – or build your own Docker container, and run it on the Kubernetes cluster.
What are Metaflow’s strengths?
I’d have to use the platform hands on, to really answer accurately, rather than just run a couple of demos.
But, if I am looking for an AI workflow tool that can work locally, and also, can throw steps in the workflow easily at a Kubernetes cluster configured with high-end GPUs – then Metaflow would be a good tool.
Metaflow employs a VSCode extension that allows you to edit locally, and execute remotely. This lets you use tools like Github Copilot easily.
For comparison, in Databricks I also can do local development before throwing workloads at a Databricks cluster. However, that is accomplished in a slightly more cumbersome way, by using the dbx CLI tool to synchronize my local Git sandbox with the Databricks cloud sandbox.
In regards to ML training – for single node multi-GPU, Databricks is just as easy to use, I think, as Metaflow – but for multi-node multi-GPU, Metaflow becomes easier.
The strength of Databricks, on the other hand, lies in its integrated data lake, integrated MLFlow, and model store. You get an integrated system for data and machine learning that is pretty complete.
Ultimately, each tool has its peculiarities and benefits.
References about Metaflow
- Netflix Tech Blog: Open-Sourcing Metaflow, a Human-Centric Framework for Data Science (2019)
- A. Goblet: A Review of Netflix’s Metaflow (2019)
- Ricardo Raspini Motta: Kedro vs ZenML vs Metaflow: Which Pipeline Orchestration Tool Should You Choose? (2024)
- Deploying Infrastructure for Metaflow
- Ville Tuulos: Metaflow: The ML Infrastructure at Netflix
Lightning AI
Robert Levy, Staff Applications Engineer at Lightning AI, gave me a quick overview of the platform.
Lightning AI, basically, makes it simpler to use PyTorch – removing a lot of the repetitive code that is often found in machine learning training infrastructure.
It provides the Lightning AI Studio, an IDE for training models locally or on the cloud, using CPUs or GPUs.
I’ve asked Robert Levy if Lightning AI Studio provisions pods in Kubernetes clusters – the way we’ve seen it with Metaflow. Instead of Kubernetes, Lightning AI Studio configures the EC2 instances directly (assuming it’s installed on AWS).
Robert explained that kube-state-metrics module would be needed to monitor the cluster, and it can be very CPU intensive. This was one of the reason direct EC2 use was preferred.
Lightning AI provides distributed training, and automated checkpointing. The Lightning Optimizer includes various techniques and algorithms to automatically adjust hyperparameters such as learning rates, batch sizes, and other training parameters to improve model performance and convergence speed.
The Lightning AI Apps provide ready-to-use example applications, and can be used as development templates, which is a nice way to get familiar with the platform – and to speed up new development.
XetHub
This is a Git extension that allows for dataset management under source control. It is similar, in that sense, to dvc. But, give that it’s fully integrated with Git, XetHub may be easier to use (I have yet to try it hands-on).
My opinion is – tools like dvc, XetHub are good for smaller datasets, where you need a quick way to save your data, without developing excessive infrastructure.
Once the datasets become large, and need to be generated or processed programmatically, using data pipelines – it is better to store the datasets directly in the data lake, or in object storage services like S3, GCS, Azure Blob Storage.
Source control is good when source code is checked in manually. Once pipelines generate data, source control is not an appropriate tool for data version control.
Dagster and Prefect – orchestration tools
Both Dagster and Prefect had booths at ODSC. I was pretty impressed with both.
They both implement declarative orchestration. What does that mean?
Instead of defining the workflow programmatically, with function calls invoked in certain states – aka imperative programming – the tools define the states, and the possible state transition.
I happen to be familiar with a tool named terraform (and, actually, pretty heavy user of it!). Terraform does declarative programming for cloud infrastructure. You define, in configuration files – which S3 buckets, EC2 instances, RDS databases, network VPCs, subnets, ECS clusters, etc – in AWS (or GCP, Azure, with the requisite infrastructure and service names).
This simplifies infrastructure programming tremendously. Indeed, for large infrastructure projects, declarative programming is the only way to be able to maintain the infrastructure.
It’s same with Dagster and Prefect: they are declarative tools for workflows. To a great extent, Metaflow is also a declarative tool for AI workflows.
What kind of workflow tools are Dagster and Prefect, specifically?
Dagster is a data orchestration tool. It manages data ingestion, data processing, reprocessing from a checkpoint.
It is ideal for projects where data requires multiple processing steps – and where you want to not repeat processing of already-processed data.
Dagster has a concept of materializing a node: if a node dependency has new data, then the node itself can materialize it (i.e., process the latest data).
While Prefect is, instead, a function orchestration tool, which intentionally does not touch the data itself. Prefect is, then, a good choice when orchestration intentionally wants to be data content-agnostic. With Prefect, none of the data is pushed to the Prefect orchestrating server. Only the function calls are tracked.
Low-Code AI Automation: KNIME and MarkovML
For the first time – or maybe I had not been paying sufficient attention? – I saw products for low-code AI automation. These products attempt to make data ingestion, dataset management, training, inference, model serving as simple as drag-and-drop.
KNIME is an open source platform for low-code AI automation – I spoke to their data scientistRoberto Daniele Cadili. The company is based in Germany. It boasts a large ecosystem of components.
While MarkovML covers similar ground, and is a proprietary platform.
This is a category I plan to follow more closely, given that it may very well be the future of the industry. Question, of course, is – how effective can one be, when the tools can only approach the problem at high-level.
I’ve asked Roberto Daniele Cadili of KNIME who are their ideal customers – and the answer, basically, it’s people who deal with data every day, but do not have the technical engineering skills to make use of the data at their disposal.
Conspicuously absent from ODSC Boston?
I did not see many tools for large language models, prompt engineering.
Some tools were available for fine tuning of large language models, but did not figure prominently in the lineup.
Did not see tools for high-end, large-scale dataset management. Or, end-to-end tools more specialized for Robotics.
The large cloud vendors were not present – AWS, GCP, Azure, Oracle. Neither were data lake vendors like Databricks or Snowflake.
The tools that were present, at the exhibit, could best be described as middleware for AI infrastructure – and most had an open source version, with an Enterprise add-on.
It was a good sampling of products – but, a smaller selection than what I’ve seen at Biotech or Healthcare conferences. This can be explained, I think, by a smaller user market – for an engineering audience, rather than the larger audiences available in the biotech, insurance, or hospital markets.
Andrei Radulescu-Banu is Founder of Analytiq Hub, which develops data and AI architectures for healthcare, revenue cycle management, and robotics workflows.