Spark + AI Summit is happening virtually next week, and I will be spending the second half of next week watching talks from there. I went through the schedule to make a preliminary list of the talks that sound interesting to me. And just like last year, I’ll be writing about them here.

My interests (what I’m filtering for)

I’m certain that all of these talks will provide value to someone out there; but I obviously can’t watch them all, nor do I want to! I’m looking out for (1) talks that might be interesting to me, (2) talks that might be useful to my organization in data science (broadly, researchers developing new methodologies), and (3) talks that might be useful to my company (large and very enterprise-y, but committed to Spark).

This means there are a few kinds of things I’m not interested in:

  • “how our company transitioned to Spark” talks, because I’m not in a position where that’s relevant to me, and we’re already heavily invested in Spark
  • A lot of technical deep-dives (e.g., about Spark internals), which go over my head
  • Spark + <other tool>, like Delta Lake, HorovodRunner, other Apache software, or Kubernetes, which are usually too specialized for me
  • Some adjacent fields (security, data engineering and warehouseing, streaming, etc.), which I’m typically not interested in

Even though Spark Summit will be virtual, it’s still easy to get burnt out at conferences. This one has hundreds of talks over three days, and realistically I’m probably not going to watch more than four or five each day besides the keynotes.

What follows is my pared-down list of the full talks schedule. I’m definitely interested in other suggestions, though, so feel free to send them to

One final note: all times listed are in CDT, since I’m in Chicago. The talks website lists everything in PDT, and a lot of my colleagues are in New York, so convert accordingly.

Wednesday, June 24

The morning keynotes (10:30 AM - 12:30 PM CDT) are probably worth watching:

  • Realizing the vision of the data lakehouse by Ali Ghodsi (CEO and co-founder of Databricks)
  • Introducing Apache Spark 3.0: A retrospective of the last 10 years, and a look forward to the next 10 years to come by Matei Zaharia (creator of Spark) and Brooke Wenig (ML Practice Lead at Databricks)
  • How Starbucks is achieving its ‘Enterprise Data Mission’ to enable data and ML at scale and provide world-class customer experiences by Vishwanath Subramanian (director of data & analytics engineering at Starbucks)

Here are some interesting-looking talks from the first half of the day:

Time (CDT) Title and link Speakers Why I'm interested
1:00 - 2:00 PM Deep Dive into the New Features of Apache Spark 3.0 Xiao Li and Wenchen Fan (Databricks) Mostly in the new UDFs, since Spark UDFs have been a pain point for my teams for a while.
1:00 - 1:30 PM Portable Scalable Data Visualization Techniques for Apache Spark and Python Notebook-based Analytics Douglas Moore (Databricks) I'm really into data viz! Headless rendering sounds really interesting, as does generally scaling up your visualizations and integrating with Spark.
1:00 - 1:30 PM Power of Visualizing Embeddings Pramod Singh (Bain & Co.) My team is getting started with embeddings for one of our models; visualizing and undersatnding them sounds interesting.
1:35 - 2:05 PM Productionalizing Models through CI/CD Design with MLflow Mary Grace Moesta & Peter Tamisin (Databricks) I love MLflow and have gotten a lot of value out of it, plus I'm interested in learning more about CI/CD for model deployment.
1:35 - 2:05 PM Tuning ML Models: Scaling, Workflows, and Architecture Joseph Bradley (Databricks) I don't really care about hyperparameter tuning, but it's a common enough topic that I figured someone might.
2:10 - 2:40 PM How (Not) To Scale Deep Learning in 6 Easy Steps Sean Owen (Databricks) The "here are common pitfalls" talks are often interesting because they open my eyes to problems I'd never considered before. The talk description features the line "deep learning sometimes feels like sorcery," which, well, yeah.

Then there are the afternoon keynotes:

  • Racism and policing: the path forward by Dr. Phillip Atiba Goff from the Center for Policing Equity
  • Science vs Covid, lessons from by Amy Heineike, an author and professor at Loyola
  • The Signal and the Noise: the Big Lessons from 20 years of Data Analysis by Nate Silver from FiveThirtyEight

None of these have descriptions (?), and I’m not really sure what the first two have to do with Spark or AI, but I’m interested in hearing about them regardless. There are more talks that follow these keynotes—

Time (CDT) Title and link Speakers Why I'm interested
4:30 - 5:00 PM A Production Quality Sketching Library for the Analysis of Big Data Lee Rhodes (Verizon) This is one of the "I've never heard of this but it sounds cool!" talks, which is about real-time stochastic streaming algorithms. These are needed when your data is too big for exact queries (e.g., count distinct), and "sketches" are the only option.
4:30 - 5:00 PM An Approach to Data Quality for Netflix Personalization Systems Preetam Joshi & Vivek Kaushal (Netflix) I love hearing about what Netflix has to deal with at their scale. This appears to be a talk about how they maintain data quality across hundreds of terabytes of data and several machine learning models that are frequently retrained.
4:30 - 5:00 PM Koalas: Making an Easy Transition from Pandas to Apache Spark Takuya Ueshin (Databricks) Koalas was released last year, and it seemed promising but incomplete. What's it like now?
4:30 - 5:00 PM Zipline – A Declarative Feature Engineering Framework Nikhil Simha (Airbnb) The "here is our custom ML platform" talks give me a glimpse into how other companies do ML, and what they view as their biggest problems. Airbnb has made a lot of cool stuff before, so this could be good.
5:05 - 5:35 PM Preventing Abuse Using Unsupervised Learning James Verbus (LinkedIn) Unsupervised learning is a pretty big gap of mine, and I've heard about isolation forests from one of my coworkers.
5:40 - 6:10 PM Pandas UDF and Python Type Hint in Apache Spark 3.0 Hyukjin Kwon (Databricks) The new UDFs are one of the most exciting parts about Spark 3.0, and this talk seems like it'll be a good technical overview of what's new.

Wow, that was a lot; I am starting to feel burnt out just from writing it.

Thursday, June 25

The morning keynotes are from 11 AM - 12:30 PM:

  • Simplifying Model Development and Management with MLflow by Matei Zaharia and Sue Ann Hong
  • How Credit Suisse is leveraging open source data and AI platforms to drive digital transformation, innovation and growth by Anurag Sehgal (meh)
  • Introducing the next generation Data Science Workspace by Clemens Mewald and Lauren Richie (this looks like a new product announcement from Databricks)
  • Responsible ML - bringing accountability to data science by Rohan Kumar and Sarah Bird from Microsoft (yes! thank you!)

And here are the talks I’m looking at—

Time (CDT) Title and link Speakers Why I'm interested
1:00 - 1:30 PM Artificial Lawyers. Will Your Next Attorney be a Machine? Fernando Ortega Gallego & Eduardo Matallanas de Ávila (Plain Concepts) This sounds like an interesting application that I don't know very much about. What's going on in AI + law?
1:00 - 1:30 PM Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow Michael Shtelma & Thunder Shiviah (Databricks) Maybe I'm getting too many MLflow talks, but this sounds interesting; automation, continuous retraining, and deployment are blind spots of mine.
1:00 - 1:30 PM Everyday Probabilistic Data Structures for Humans Yeshwanth Vijayakumar (Adobe) I don't know anything about probabilistic data structures, but one of our teams is using bloom filters for a certain application. This sounds like a nice high-level overview of these tools.
1:30 - 2:00 PM Scaling Quantitative Research on Sensitive Data Slava Frid (Worldquant Predictive) This sounds like an interesting challenge - buliding models and managing data while minimizing the number of eyes that see the data itself, and ensuring transparency and audits.
1:30 - 2:30 PM Women in Unified Data Analytics Panel Discussion Ali Vanderveld (ShopRunner), Franziska Bell, Ph.D (Toyota Research), Rama Assaf-Smith (Comcast) This is a panel by three female leaders in analytics, and it for sure looked worth including.
2:10 - 2:40 PM On Improving Broadcast Joins in Apache Spark SQL Jianneng Li (Workday) I learned about broadcast joins last year and found them to be a great trick to have up my sleeve; this sounds like a good place to learn more.
2:10 - 2:40 PM The 2020 Census and Innovation in Surveys Zack Schwartz (U.S. Census Bureau) The intersection of traditionally-small-data surveys with the scale of the US Census creates interesting problems (according to the talk description).
2:10 - 2:40 PM Using Bayesian Generative Models with Apache Spark to Solve Entity Resolution Problems (DeDup, Merging, Uniqueness) at Scale Charles Adetiloye & Timo Mechler (MavenCode) Bayesian modeling! How does that work with Spark?
2:10 - 2:40 PM Using Machine Learning to Evolve Sports Entertainment David Cunningham (DataFactory) & Young Bang (Atlas Research) I guess Nielsen is in entertainment. The talk description poses interesting quesitons.

The Thursday afternoon keynotes are from 3 - 4 PM:

  • Deep Learning: It’s Not All About Recognizing Cats and Dogs by Kim Hazelwood, who is the head of engineering at Facebook AI Reserch, talking about how recommendation systems can be problematic.
  • Creating, Weaponizing, and Detecting Deep Fakes by Hany Farid, a Berkeley professor, which appears to be an overview of deepfakes and emerging techniques for detecting them.

I’m losing steam on the reason that I find talks interesting—it’s hard to pin down why I want to watch one beyond “this sounds cool.” But here are the Thursday afternoon talks.

Time (CDT) Title and link Speakers Why I'm interested
4:30 - 5:00 PM Automated and Explainable Deep Learning for Clinical Language Understanding at Roche David Talby (Pacific AI), Vishakha Sharma & Yogesh Pandit (Roche) I saw "explainable" and got excited. It looks like embeddings, transfer learning, and post-hoc explanations.
4:30 - 5:00 PM Building Identity Graphs over Heterogeneous Data Sudha Viswanathan & Saigopal Thota (Walmart Labs) Identity graphs are an interesting problem, and relevant to a few folks at my company.
5:05 - 5:35 PM Geospatial Analytics at Scale: Analyzing Human Movement Patterns During a Pandemic Joel McCune & Jim Young (Esri) Human movement during COVID sounds like a really interesting quesiton, so I picked this talk because I just wanted to learn about the subject.
5:40 - 6:15 PM Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow Perry Stephenson (Atlassian) This seems like a nice end-to-end talk, covering how they build forecasts, train models, and deploy them with a robust pipeline. (Ugh, though, the self-promotion in the description---"the demand for our legendary support continues to grow." I know that these are partially for branding, but come on.)

Friday, June 26

One more day! The morning keynotes, from 11 AM to 12:30 PM CDT, include:

  • PyTorch: a modern machine learning research and production platform by Adam Pazske, which seems mostly like a pitch for PyTorch.
  • Rapid Response Research for COVID-19 and other challenges: Machine Learning and Data Science at Cal by Prof. Jennifer Chayes. This will describe the vision of Berekey’s division of computing, data science, and society (CDSS), especially in the context of COVID. Sounds great!
Time (CDT) Title and link Speakers Why I'm interested
12 - 12:30 PM Democratizing PySpark for Mobile Game Publishing Ben Weber (Zynga) I went to Weber's talk at Spark Summit last year, and it was fantastic (see my notes!); this is on my list because he's back, and also because the idea of democratizing analytics sounds quite cool.
12:35 - 1:05 PM Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel & Abe Gong (Superconductive) This is a talk about testing data and data pipelines, instead of just code; they introduce a framework they built for capturing undocumented assumptions and validating new data against them.
12:35 - 1:05 PM Chromatic Sparse Learning Vladimir Feinberg (Sisu Data) This is another "what the hell is this? woah!" talk. They demonstrate "a chromatic approach to sparse learning" which uses approximate graph coloring to reduce the cardinality of a large (here, 3.2M) feature space.
1:10 - 1:40 PM Generalized SEIR Model on Large Networks Amir Kermany (Databricks) Discussing SEIR models for simulating the spread of infectious diseases, and how these can be implemented in Spark.
1:45 - 2:15 PM Designing the Next Generation of Data Pipelines at Zillow with Apache Spark Derek Gorthy & Nedra Albrecht (Zillow Group) This goes over how Zillow data engineers identified pain points and tech debt, then turned around to build more robust Spark pipelines.
1:45 - 2:15 PM From Idea to Model: Productionizing Data Pipelines with Apache Airflow Daniel Imberman (Astronomer, Airflow committer) Many of our teams are starting to use an Airflow-based tool (though not me), and this seems relevant for learning more about that. The author is an Airflow maintainer.
1:45 - 2:15 PM Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members Emilie de Longueau & Abdulla Al-Qawasmeh (LinkedIn) I chose this because it sounds like a fun problem; how does LinkedIn generate recommendations for nearly 700 million users to follow any of 10s of millions of entities? They developed a new join algorithm that reduces shuffling and talk about its impact on downstream metrics.

And that’s all! If there are any talks that you feel like I missed, please feel free to send them along to I’ll be back next week with notes from some of these.