On edge cases; machine learning education; the mainstream media and real and fake news; and other, shorter work.

Edge case poisoning

Author: Hillel Wayne

This post walks through a simple exercise:

  • open a book of recipes,
  • try to make a data structure that can encode that recipe,
  • flip to another page,
  • adjust your data structure as needed.

Predictably, this ends poorly. Recipes can include items in terms of their mass (150g sugar), volume (2 cups flour), or quantity (4 eggs). They can include edible and non-edible ingredients. They can include other recipes as components, or even include themselves.

I call this edge case poisoning. It’s almost impossible to avoid because anything dealing with the real world is going to have tons and tons of edge cases. You think handling time zones is hard? Time zones are only notorious because they are a real-world domain that we all have to deal with.

That’s what real code looks like. Ouch! Real code is about dealing with all of the awful, pesky, pervasive edge cases that arise everywhere in everything. I can’t count the number of crazy data anomalies that I’ve had to account for in my code, either by handling them (when possible) or (when not) designing my code in a way that bypasses the problem.

I think a fun (for some definition of fun) exercise for a beginner learning about data structures would be to try to design data structures for progressively more complex real-world examples. Recipes, shipping plans, TV programs, music, … pick something and try to model it in code. That could be educational.

Transparent APIs by Anders Hovmöller was in the Reddit comments for the above post. It motivates and builds up a similar, hacky example of parsing an RSS feed, handling optional authentication, encoding errors, corrupted XML, and more.

The solution is … not totally clear to me. God, Python decorators still take several read throughs for me to understand them. I think this is restyled partial application or dependency injection. Hard to say, though—I personally think the blog post was quite challenging to understand.

Where machine learning education falls short, by Jupinder Parmar on The Gradient, discusses the gap between machine learning education and practice. One of the main differences is in ‘dataset management,’ where in education you are often given a dataset for some task, but in industry you have to collect it yourself. A related challenge is in knowing how to clean and prepare data for machine learning.

It is clear that students are not taught enough about how to properly manage the data they are working with. Not only do the classes provide students with cleaned up datasets that already have been neatly pre-processed, they don’t promote much exploration beyond visualizing a couple data points. This lack of hands-on learning with how to normalize and explore datasets is detrimental to a student’s practical ability to conduct ML.

There’s more, too. Spending the first few weeks on linear classifiers on backpropogation reduces time to cover more advanced subjects, and the author suggests these should be covered in prerequisite classes. This was good, though a bit surface level; it would have been nice to see more discussion about how these might be overcome.

This Twitter thread by David Rothschild discusses how important the mainstream media is to electoral framing. He uses the example of Hillary Clinton’s emails from 2016, which became the salient issue of her campaign for a large number of voters. But the NY Times chose to cover this repeatedly:

@nytimes did not actually think Clinton had an IT Security problem that the American people needed to know in order to choose wisely in the 2016 election: expecting Clinton to win the election, they needed a story to show how “balanced” they were by bashing her.

The book Network Propaganda makes this point several times. In an inherently asymmetric media ecosystem marked by right-wing radicalization, there cannot be any notion of balance; to provide “balanced” coverage of candidates will inevitably involve treating one unfairly.

Mainstream media’s light handling of Trump’s way worse IT Security failures show that the issue is not inherently important, but that the mainstream media made it important for Clinton, while choosing to not make it important for Trump. This is not just for the mainstream media, but the academic research community that obsessed with fake news on social media, but barely explores mainstream news, especially TV, which includes about 85% of news consumption.

All very good points that I completely agree with.

Other things I’m reading: