Notes from the Wednesday morning Spark keynotes

2020-06-24 524 words spark,

Spark + AI Summit kicked off with keynotes by Ali Ghodsi (Databricks CEO), Matei Zaharia (creator of Spark) and Brooke Wenig (Databricks machine learning lead), Reynold Xin (Spark contributor), and Vishwanath Subramanian (Starbucks director of data). I wasn’t able to listen to all of it, but this post has a few highlights.

What’s new in Spark?

One seemingly big feature is Adaptive Query Execution: Spark can automatically decide how to optimize jobs and split queries up into subqueries. This is opt-in right now, and will soon become the default, making shufflings easier. They’re adding hints for join optimizations, too.

Python 2 is being deprecated (yay!) and they’re improving GPU support (right now it’s kind of hacky?). They’re also improving support for distributed machine learning (previously we had to use something like HorovodRunner, and it was super finnicky).

The redesigned UDFs look great, too; UDFs were pretty challenging to work with before now.

Other announcements: koalas and pyspark

Databricks announced koalas v1.0, noting that they’re near 80% API coverage of pandas. It’s really cool to see how far they’ve come in just a year after having announced koalas last year.

They also noted miscellaneous improvements to Pyspark, which they’re calling Project Zen (in line with the Zen of Python):

better error messages! the Pyspark stack traces that were hundreds of lines of Java, only to find a ValueError buried within, were probably my least favorite part of using Pyspark before.
better documentation (it looks pandas-inspired!)
some API changes that they didn’t expand on at all

This is great—it’s cool to see Databricks investing more in the usability of their platform.

The data lake myth

Ali spent a while discussing how the transition from data warehouses to data lakes often left a lot to be desired. In theory, you should have a data lake that has all of your data, right? In practice, when their solutions architects were working on these projects, they noticed 9 common challenges with data lakes:

hard to append data (adding newly arrived data causes incorrect reads)
hard to modify existing data (due to GDPR or CCPA)
jobs failed midway through, causing some data to be missing
real-time operations were challenging, due to inconsistencies between streaming and batch
costly to keep history
difficult to manage metadata
problems with too many files
poor performance
data quality issues (these are getting pretty vague …)

They described how Delta Lake (and recent improvements to it) can help these problems. By making every operation transactional, things will either fully fail or fully succeed, and this apparently solves the first five problems. One of the flashiest parts is that you can do a SELECT * FROM table AS OF TIMESTAMP [a year ago], which is wild!

More changes, still

Reynold went into more detail about Delta Engine, which I frankly neither understand nor expect to need to understand. Spark is faster?

Databricks announced their acquisition of Redash, which is an open-source dashboarding service for an entire company. This followed by a slick demo of Redash, which I had never seen before, but combines SQL queries with your normal database schemas with click-and-drag, dropdown-customizable visualizations. It seems nice!

Return home