The Netflix Machine Learning Infrastructure

When building machine learning models to work over different areas, data scientists usually need to deal with these two situations:

  • How can I deploy my model and see results?
  • Will my model affect the rest somehow?

Mainly, these two questions are about infrastracture and collaboration, which requires devops engineers to efficiently solve it. So, we have now two roles here where a data scientist should focus on: 1.- Write a model for a paradigm 2.- Deploy it and see the results

And a devops engineer should focus on: 1.- Run multiple models independently in a cluster infrastructure 2.- Build a data warehouse system to store/trigger the results

We’ll quickly introduce the Netflix Machine Learning insfrastructure whereby Metaflow is used as an interface among the data scientist and devops worlds.

A Machine Learning Stack

  • Data Warehouse: Cluster Infrastructure and Distributed Storage
  • Compute Resources: Cluster Resources Management
  • Job Scheduler: Model Execution Management
  • Versioning: Model Versioning
  • Collaboration Tools: Model Repository
  • Model Deployment
  • ML Libraries

Data Scientist viewpoint

A data scientists will focus on writing models using any ML Library and will deploy its models with some Model Deployment tool.

DevOps viewpoint

A devops will rule multiple versioned models using a Job scheduler over a compute resources composed by a cluster infrastructure and a data warehouse.

The Netflix Machine Learning Stack


Titus is a container management platform that natively works with AWS. It manages a large number of AWS EC2 instances for service workloads. From the Titus documentation:

“Titus is a framework on top of Apache Mesos, a cluster-management system that brokers available resources across a fleet of machines. Titus consists of a replicated, leader-elected scheduler called Titus Master, which handles the placement of containers onto a large pool of EC2 virtual machines called Titus Agents, which manage each container’s life cycle. Zookeeper manages leader election, and Cassandra persists the master’s data. The Titus architecture is shown below.”

Titus Architecture


Meson is a workflow management tool to provide task isolation with Titus (and then Apache Mesos) and job scheduling:

Meson Architecture


Why do we need a new tool if we already have GIT as a source repository? Well, NTeract is more that a source repository. Data scientists also need a way to collaborate in live when writting common models or any kind of notebooks. NTeract is a desktop application and allows parallel editing among different users. Personally, I didn’t know this tool before and I can say it’s really cool


This tool allows data scientists to wrap the above layers of the architecture in a single place and tool. Metaflow can be used for python or R languages.

Let’s introduce the next example taken from here:

from metaflow import FlowSpec, step

class MyFlow(FlowSpec):

    def start(self):
        self.x = 0, self.b)

    def a(self):
        self.x += 2

    def b(self):
        self.x += 3

    def join(self, inputs):
        self.out = max(i.x for i in inputs)


This is how to run it:

> python run

Now, these are the features:

  • Resume with steps: Each Step has a persistent state

The fields like self.x is persisted after each method annotated with @step. Therefore, we can resume our model in a concrete step:

> python resume B

This is really helpful for troubleshooting purposes.

  • Scalability: can use titus annotations in each step

Each method annotated with @step is independently scheduled on the Job Scheduler. Moreover, we can define the Titus compute resources declarativelly with the @titus annotation:

// ...

    @titus(cpu=16, gpu=1)
    def a(self):
        self.x += 2

    def b(self):
        self.x += 3

// ...
  • Data Load from S3 using Table

Well, all about this is to load large data, big data, and analyse it. So, how can we load efficiently large data with Metaflow?

from metaflow import Table
// ...
    @titus(memory=20000, network=20000)
    def load(self):
        df = Table('my_s3_user', 'a_table')

// ...
  • Versioning with namespaces
# Access coworker 1's runs
run = Flow('MyFlow').latest_run

# Access coworker 2's runs
run = Flow('MyFlow').latest_run

# Access everyone's runs
run = Flow('MyFlow').latest_run
  • Using a simple Meson create command to deploy
python meson create
  • See the metrics of models via Notebooks and Metaflow hosting

Data Scientits are allowed to expose the job results via endpoints:

from metaflow import WebServiceSpec
from metaflow import endpoint

class MyWebService(WebServiceSpec):

    def show_data(self, request_dict):
        result = self.artifacts.flow.x
        return { 'x': result }
curl http://host/show_data
{"x": 3}


This article is based on a InfoQ conference by Ville Tuulos from Netflix.

[ Netflix, Machine Learning ]