This week, I watched a few talks from SciPy 2020 and continued my reading on Twenty-Six Words; this post mostly contains the talks.

[Talk] JAX: Accelerated Machine Learning Research

Speaker: Jake VanderPlas

This talk, from (virtual) SciPy 2020, motivates, demos, and briefly explains JAX, a Google library for high-performance machine learning research. Loosely speaking, JAX is the NumPy API which is JIT compiled so that it can run faster and on different devices (GPUs or TPUs).

The demo was slick. I had no idea that JAX could be 100x faster than normal NumPy for basic matrix multiplications! But what I found most interesting was the explanation of the “tracing” mechanism that they use for compiling functions. JAX sends “abstract values” through the functions to record what kind of object (meaning shape? I think?) each step will produce.

This was a great overview of JAX! I plan to recommend it to my team, which is using the JAX-backed NumPyro, and other friends as well.

[Talk] Ray: A System for Scalable Python and ML

Speaker: Robert Nishihara

This project introduces Ray, a Python library for building distributed applications. The motivation was that existing distributed computing systems need tons of expertise to understand and use, and that the tools are becoming more specialized. Ray aims to be performant and general-purpose by providing higher-level abstractions than, say, Spark.

The core concepts appear to be wrapping functions and classes in a decorator, @ray.remote(), to turn them into tasks and actors respectively. These return futures asynchronously, rather than blocking execution, and Ray manages the scheduling overhead. Ray handles the overhead in setting up your remote processes (on e.g., AWS or GCP), distributing the workloads, and debugging and monitoring them via a dashboard.

Sounds cool! The story of async programming in Python has always been challenging, so I’m excited to see something new happening here. This was a good introduction to a library that I knew nothing about previously—great stuff!

What data scientists need to know about DevOps

Author: Elle O’Brien

In this Towards Data Science post, Dr. O’Brien writes about the growing divide between data science and DevOps:

With the rapid evolution of machine learning (ML) in the last few years, it’s become trivially easy to begin ML experiments. Thanks to libraries like scikit-learn and Keras, you can make models with a few lines of code.

But it’s harder than ever to turn data science projects into meaningful applications, like a model that informs team decisions or becomes part of a product. The typical ML project involves so many distinct skill sets that it’s challenging, if not outright impossible, for any one person to master them all — so hard, the rare data scientist who can also develop quality software and play engineer is called a unicorn!

This is a sentiment I’ve heard a lot, and it’s certainly been true in my experience at Nielsen; models are a small part of the actual engineering complexity involved. All of the infrastructure surrounding models—the data pipelines, deployment, serving, monitoring, … all the other things that I don’t know even exist—seem, to me, more complex than the models themselves.

The author introduces CI, then suggests using it to (1) automate everything you can, like data processing or model training, (2) get rapid feedback on new ideas by running things in a prod-like environment, and (3) reducing manual handoffs. She shows an example Github Actions workflow to retrain your model and post visualizations or metrics in comments whenever a PR is updated.