#14. NullPointerException

Topics: Architecture, data quality, data thoughts, Apache Spark


Dealing with null in Spark — Matthew Powers @ MungingData Blog.

The Ultimate Question of Life, the Universe, and Everything: What is NULL?

Thanks to @xnegxneg for this topic.

level:medium topic:spark


Separation of Compute and Data: A Profound Shift in Data Architecture — Billy Bosworth @ Dremio blog.

Splendid little article. Pros but not cons of separate storage and compute.

level:medium topic:architecture


Introducing Amazon S3 shuffle in AWS Glue — AWS Big Data Blog.

Good explanations of how shuffle works, how it uses local disks, how to track it in AWS Glue UI, what problems we can potentially have and how to escape it.

level:medium topic:architecture topic:spark


The Future of the Data Engineer — Barr Moses @ Towards Data Science Blog.

Feels like a bit of a heart-to-heart. Do you (data engineer) feel yourself as worst seat at the table?

level:beginner topic:data-thoughts


Automating Large-Scale Data Quality Verification — Amazon Research.

In this whitepaper, the amazon research team presents a system for automating data quality verification at scale and discusses their design decisions, describing the resulting system architecture and giving an experimental evaluation on various datasets.

level:advanced topic:data-quality type:whitepaper


Written on November 12, 2021