Topics: Architecture, data quality, data thoughts, Apache Spark
Dealing with null in Spark — Matthew Powers @ MungingData Blog.
The Ultimate Question of Life, the Universe, and Everything: What is NULL?
Thanks to @xnegxneg for this topic.
Separation of Compute and Data: A Profound Shift in Data Architecture — Billy Bosworth @ Dremio blog.
Splendid little article. Pros but not cons of separate storage and compute.
Introducing Amazon S3 shuffle in AWS Glue — AWS Big Data Blog.
Good explanations of how shuffle works, how it uses local disks, how to track it in AWS Glue UI, what problems we can potentially have and how to escape it.
The Future of the Data Engineer — Barr Moses @ Towards Data Science Blog.
Feels like a bit of a heart-to-heart. Do you (data engineer) feel yourself as worst seat at the table?
Automating Large-Scale Data Quality Verification — Amazon Research.
In this whitepaper, the amazon research team presents a system for automating data quality verification at scale and discusses their design decisions, describing the resulting system architecture and giving an experimental evaluation on various datasets.