Hello! This week’s reading was a lot of Morning Paper articles, AI skepticism from a VC firm, and posts from /r/MachineLearning.
The Morning Paper articles
Author: Adrian Colyer
I read three articles from The Morning Paper this week:
- Meaningful Availability is about the Google G Suite team’s search for a meaningful availability metric: one that reflects what their users experience while also being useful to their engineers. The post describes the path from “proportion uptime” (my first instinct when thinking about availability) to a more informative metric.
- Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure presents a system built by Microsoft, called Gandalf, which is the software deployment monitor that Azure has used for (at least) a year and a half. The write-up describes how Gandalf improved deployment times, and (my favorite) how teams built trust in it as it detected complex failures that experts missed. Gandalf ingests a large amount of data and passes them through fast (~5 minutes) and slow (~hours) paths to detect defects. The algorithms used are a combination of anomaly detection, correlation analysis, and a decision engine.
- Firecracker: lightweight virtualization for serverless applications is by far the most systems-heavy of these papers, which describes the virtual machine monitor behind AWS Lambda. The write-up explains some of the typical problems with containers, language VMs, and virtualization, then how Firecracker solves them.
Authors: Martin Casado and Matt Bornstein
This is an article by Andreessen Horowitz, a VC firm, expressing skepticism towards AI startups. The authors noticed a trend towards AI companies not having “the same economic construction” as more traditional software companies, and sometimes even look like services companies. Many AI companies have:
- lower gross margins, due to heavy cloud computing costs and human support
- scaling challenges, due to edge cases
- weaker defensive moats, due to the commoditization of AI and the fact that data by itself is often not a strong enough moat
Software companies build a product once and sell it many times; this low marginal cost creates high gross margins, recurring revenue, and linear scaling. Services companies require new people for each project, resulting in lower margins and challenges scaling. The authors argue that AI companies are increasingly combining elements of both; at the heart of these companies are trained models that can perform complex tasks. But the maintenance of these models can often feel like a services business.
Gross margins 1: cloud computing is a substantial (sometimes hidden) cost. Training a model can host hundreds of thousands of dollars, and inference and storage can get expensive. Model complexity is rapidly growing, and with it the costs of computing are as well.
Gross margins 2: many applications rely on “human in the loop." Datasets often have to be manually cleaned and labeled, which is expensive and time-consuming. This process has to continue throughout a model’s lifespan, too. In other cases, ML systems work alongside humans (social media moderation, medical inference with physicians, autonomous driving). It is challenging (in terms of both model performance and regulation) to create fully automated ML systems.
Scaling is hard, because models live “in the long tail." Because ML models often operate on messy data with little to no structure, edge cases are everywhere. In the VC firm’s experience, “40 – 50% of intended functionality … can reside in the long tail of user intent.” Handling different data requirements is difficult; imagine two manufacturers who have cameras in different places on assembly lines, who are both interested in defect detection. Those require separate models. It’s also difficult to accurately predict how long data collection and model training will take.
Technical differentiation (moats) are harder to achieve. New model architectures are mostly developed in the open; the norm in ML is to open-source reference implementations; and more generally, ML is being commoditized over time. This means that ML, by itself, may not be enough for defensibility.
The authors provide a lot of advice for founders, and the most interesting to me are eliminating model complexity by creating a single model (or set of models), which is easier to roll out and maintain; and reducing data complexity by choosing problem domains carefully and focusing on high-scale, low-complexity tasks.
This is really salient as Nielsen tries to “embrace AI”. Building general purpose models is hard, and so instead we have a number of models that were created for one purpose (e.g., demographic assignment to TV viewers in a particular panel) which constantly have to be adapted to new use cases. This is largely caused by data complexity in new problems that we take on.
Granted, we’re not an AI startup—but these problems apply to companies that are trying to use more ML too. Edge cases are pervasive, and the complexity of our (and our clients’) data is, in my experience, the biggest challenge to getting anything done. Defensibility isn’t much of a problem for Nielsen, but the more gritty details of AI certainly are.
The Storytelling Style Guide from NZZ Visuals, with an accompanying blog post is documentation the visual language that their organization uses. Their visual principles are “determined” (graphics have a clear purpose and message), “accentuated” (highlight what’s important), “precise” (using the best data and clera language), and “authentic” (reflecting their collective voice). Guidelines include introducing readers to data, guiding readers through them, including context, and prioritizing communication over decoration. Having read style guides for written text, it was interesting to see a data journalism team’s perspective on creating a unified language for graphics.
OpenAI’s Jeff Clune on deep learning’s Achilles’ heel and a faster path to AGI is just an article about an interview with Jeff Clune. He argues that there’s a faster path to AGI: meta-learning algorithms and architectures, which are the subject of a paper published Monday. I didn’t get much out of this article, sadly.
Transformers are Graph Neural Networks is a blog post by Chaitanya Joshi in the NTU Graph Deep Learning Lab. This establishes links from the Transformer architecture, used in GPT-2, BERT, and more, and Graph Neural Networks. Broadly speaking, Transformers are GNNs with multi-headed attention as the “neighborhood aggregation function” which treats sentences as neighborhoods and uses attention to know which words are “important.”
By establishing links between the two architectures, the author can focus on the question of what they can learn from each other. One question is if fully connected graphs are the best format to use, since they make learning long-term dependencies difficult. From NLP, we have ideas like sparse or adaptive attention, recurrence or compression, or locality sensitive hashing. From GNNs, we might have the idea of binary partitioning to sparsify graphs.
This is a good post, though rather out of my depth. I appreciate the tremendous amount of work which clearly went into this.
Attention and Augmented RNNs, by Chris Olah and Shan Carter, is a Distill publication about the attention mechanism in RNNs at a higher, more conceptual level. They break down four interesting research applications of attention: neural turing machines, attentional interfaces, adaptive computation time, and neural programmers. One way of thinking about attention is that it “takes every direction at a fork, and then merges the paths back together”—it allows a model to spread out how much it cares about different parts of an input.
This article was a great introduction to what I previously understood very little of. I don’t ever see myself going into machine learning research, but reading about new advances in the field is valuable to me. I’m also a huge fan of Distill for pushing the boundaries of ML articles by devoting themselves to exceptional communication. This would have been impossible for me to get started with had I tried reading the original papers.
Speeding Up Transformer Training and Inference by Increasing Model Size by Eric Wallace is a blog post from Berkeley AI Research. Wallace summarizes a new paper that suggests that, when training Transformer models, it may be better to increase model size but cut off training early. “We rethink the implicit assumption that models must be trained until convergence by demonstrating the opportunity to increase model size while sacrificing convergence.”
This happens because large models converge to lower test error with fewer gradient updates. Empirically, training deeper or wider versions of RoBERTa shows lower perplexity and higher BLEU (?) given the same wall clock training time. I find this interesting because there is so much we don’t know about neural networks: the topology of the training landscape is one of the major ones. I’m excited to see where research goes in the future.