Articles tagged with spark

Simplify PySpark testing with DataFrame equality functions — Haejoon Lee, Allison Wang and Amanda Liu @ Databricks Engineering Blog

Finally we have PySpark functions for testing! Starting from Spark 3.5. No more additional libraries for testing. Or, maybe…

Spark SQL Query Engine Deep Dive – Adaptive Query Execution: part I & part II – Linxiao Ma

It’s absolutely brilliant blog. I adore it. These articles declares Spark Adaptive Query Execution very deeply.

Spark Connect Available in Apache Spark 3.4 — Databricks Blog

Just to keep in touch with a new Spark features. Give it up for Spark Connect.

Apache Spark — Job monitoring — Hareesha Dandamudi

Short and understandable article about SparkListener. Quite a useful thing for creating monitoring and lineage handling (as OpenLineage does). Just use it as a start for your own articles research :)

Upgrading Data Warehouse Infrastructure at Airbnb — Ronnie Zhu @ The Airbnb Tech Blog

Airbnb decided to refresh Spark and start using Iceberg. In this article, we will see motivation, case studies, and tuning experience.

Apache Spark Performance Boosting — Halil Ertan @ Towards Data Science.

We understand, that it’s just a few hacks about Spark optimizations, but it’s a nice way to start.

Building data platform in PySpark. Part 1. Python and Scala interop — Sergey Ivanychev @ Joom Blog.

Why and how to use Scala in PySpark.

Dealing with null in Spark — Matthew Powers @ MungingData Blog.

The Ultimate Question of Life, the Universe, and Everything: What is NULL?

Thanks to @xnegxneg for this topic.

Introducing Amazon S3 shuffle in AWS Glue — AWS Big Data Blog.

Good explanations of how shuffle works, how it uses local disks, how to track it in AWS Glue UI, what problems we can potentially have and how to escape it.

Best practices for caching in Spark SQL — David Vrba @ Towards Data Science Blog.

Great and understandable article about caching with under the hood explanations and examples. Hope now using the cache will be much easier!

On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies — Airbnb Engineering Blog.

Millions of small files in HDFS or almost all you want to know about Spark partitioning. At least for a start :)

Higher-Order Functions with Spark 3.1 — David Vrba @ Towards Data Science Blog.

New functions for manipulating with arrays have been released. Check maybe you can forget about UDF.

Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming — Tathagata Das @ Databricks Blog.

If you use Spark Streaming and are interested in handling late-arriving data, this article gives you a practical approach to which windows strategy use and how watermark can help you.

Creating Pandas and Spark Compatible Functions with Fugue — Kevin Kho @ Towards Data Science.

Everybody knows about Apache Arrow, which aims to create an effective in-memory storage format for the interaction of different libraries/frameworks. But what if there were such universal format not for storage, but for functions? There is one: Fugue.

Using Distributed Computing for Neuroimaging — Dr. Alessandro Crimi @ Towards Data Science.

Did you ever think that data engineering may literally save lives? Well, it turns out it can. Real-world example.