Happy leap year! This week focuses on articles about data science: production code, pipeline building, and design thinking.

Data scientists, the only useful code is production code

Author: Thomas Huijskens

How I found this: linked in a separate blog post by QuantumBlack that I was reading as part of interview prep

Summary: the author posits that production code is any code that feeds a business process. This includes ad-hoc analyses that inform financial decisions, reports that are automated and sent out weekly, and (most commonly understood as “production code”) modeling pipelines. Code is a main product or byproduct of a data scientist’s work, and so its quality is paramount. The post goes on to include recommendations for writing good code, which I encourage anyone interested to read for themselves.

Thoughts: this is a more extreme version of what I wrote in Software Engineering is an Essential Part of Data Science, and I wholeheartedly agree. I’d never thought of ad-hoc analyses as production code, but extending that definition to anything that might need repeating seems prudent. Having a “software-first” mindset is critical to being an effective data scientist.

Kedro: A new tool for data science

Author: Jo Stichbury

Summary: this is a blog post introducing Kedro, a data science workflow tool to facilitate pipeline creation. It helps code to be more easily reproducible, modular, and well-documented. It is not a workflow scheduler, and it can work alongside something like MLflow; Kedro provides the architecture and structure, and MLflow provides the tracking framework. The post continues with a basic usage tutorial.

Thoughts: I’m getting a little exhausted with all the libraries claiming to be the missing puzzle piece in the data science workflow. Tons of companies have built their own data science platforms (see: Uber’s Michelangelo, Facebook’s FBLearner Flow, Nielsen’s Intelligence Studio and Uni, and who knows how many others). But QB claims to have used this on 60+ projects before open-sourcing it, so I have hope about its robustness.

Looking at the Github issues, I appreciate the uncommon strategy of proactively seeking user input with “how do you feel …" questions. The repo appears to be maintained, though the project hasn’t existed for very long, and the maintainers seem to be receptive to public feedback. That checks most of my open source hygiene boxes.

Weaving data stories and driving impact: a design-led approach

Author: Thomas Essl

Summary: this post talks about the virtues of design thinking in a data science context. The design team started with the questions “what is the true underlying problem” and “whose problem is it” before looking at any data (worried that this may bias or limit their thinking). Starting with this enables you to use data as a constraint and as an enabler, driving research forward while helping you to understand what might or might not be achieved.

When searching for an answer to a data-focused question, we are obligated to first make sure the question is the right one.

The author continues: uncover human stories in data by thinking of each row in terms of the person(s) it represents, and pay special attention to the data you don’t have (this in particular can help inform a broader strategy).

Thoughts: I love design thinking, and I’m really grateful to Northwestern Engineering for helping (read: requiring) me to learn it. Going from building models to designing solutions is, to me, an important step in a data scientist’s development, and it’s a step that I think I have begun taking in my current work. This post fits perfectly in the “human-centered data science” box within my interests.

It is therefore vital to build models with the user’s context and needs—their human story—to the front of mind. The work of an outcome-focused analytics team does not finish once a model is delivered—it must be proved to work, adopted by the user and optimised depending on how it is deployed.

I love this. Data science doesn’t start with a dataset; it starts with the people or events behind the dataset, and it must consider what data is there and what isn’t (and why). Likewise, it doesn’t end with a model; it ends with the people your work affects, and the impacts that it has on them. This is what I care about—how data science, machine learning, and complex technical systems affect real people.

How to be better election forecasters

Author: G. Elliott Morris

How I found this: my friend and former coworker, Judah

This piece appears to be a criticism of another analyst (or, perhaps, pundit) which morphs into general election forecasting advice. The biggest issues with novice election forecasters, Morris claims, is that they fail to think probabilistically. This doesn’t mean not reporting uncertainty or speaking in terms of point estimates (though I’ve seen that happen), but rather failing to account for the possibility of systematic error, treating 80+% as “very likely,” or talking about things that “will” or “won’t” happen.

Another important component is aggregating uncertainy, or figuring out how we jump from state-level forecasts (whose outcomes are correlated) to national results; that’s “the million dollar question,” he writes, but typically we draw correlated random errors and call it a day. Accounting for known unknowns is also challenging—there are errors from misidentifying the size of the electorate or what voter turnout will be. Good forecasts account for these.

This was a good piece by someone who appears to just be getting into election forecasting. Morris wrote about how the Democratic primary is in uncharted waters, and I largely agree. But I love this thrill, so it’s fun to watch.

Other, shorter articles (and media)

The ethics of artificial intelligence (podcast) is part of McKinsey’s podcast talking about how companies can ethically deploy AI. This is substantially more pragmatic than most AI ethics pieces I read, and I think that was its primary value to me. AI deployment is often a risk management game, the speakers posit, and these risks take many forms: to the creators (financially, reputationally), to the people the models effect, to the public, etc. Likewise, “bias” has a lot of meanings, but in practice usually means “disparate impact,” which is well-defined legally.

You think about everyone needs to understand a little bit about AI, and they have to understand, “How can we deploy this technology in a way that’s ethical, in a way that’s compliant with regulations?” That’s true for the entire organization. It might start at the top, but it needs to cascade through the rest of the organization.

This was quite good—I expected a lot of fluff, but instead found nuanced discussion of complex, overlapping issues. Also, thanks McKinsey for providing a transcript of the podcast!