Articles tagged with spark

Spark SQL Query Engine Deep Dive – Adaptive Query Execution: part I & part II – Linxiao Ma

It’s absolutely brilliant blog. I adore it. These articles declares Spark Adaptive Query Execution very deeply.

level:advanced topic:spark


Spark Connect Available in Apache Spark 3.4 — Databricks Blog

Just to keep in touch with a new Spark features. Give it up for Spark Connect.

level:medium topic:spark


Apache Spark — Job monitoring — Hareesha Dandamudi

Short and understandable article about SparkListener. Quite a useful thing for creating monitoring and lineage handling (as OpenLineage does). Just use it as a start for your own articles research :)

level:medium topic:spark


Upgrading Data Warehouse Infrastructure at Airbnb — Ronnie Zhu @ The Airbnb Tech Blog

Airbnb decided to refresh Spark and start using Iceberg. In this article, we will see motivation, case studies, and tuning experience.

level:medium topic:architecture topic:iceberg topic:spark


Apache Spark Performance Boosting — Halil Ertan @ Towards Data Science.

We understand, that it’s just a few hacks about Spark optimizations, but it’s a nice way to start.

level:medium topic:spark


Building data platform in PySpark. Part 1. Python and Scala interop — Sergey Ivanychev @ Joom Blog.

Why and how to use Scala in PySpark.

level:medium topic:spark


Dealing with null in Spark — Matthew Powers @ MungingData Blog.

The Ultimate Question of Life, the Universe, and Everything: What is NULL?

Thanks to @xnegxneg for this topic.

level:medium topic:spark


Introducing Amazon S3 shuffle in AWS Glue — AWS Big Data Blog.

Good explanations of how shuffle works, how it uses local disks, how to track it in AWS Glue UI, what problems we can potentially have and how to escape it.

level:medium topic:architecture topic:spark


Best practices for caching in Spark SQL — David Vrba @ Towards Data Science Blog.

Great and understandable article about caching with under the hood explanations and examples. Hope now using the cache will be much easier!

level:medium topic:spark


On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies — Airbnb Engineering Blog.

Millions of small files in HDFS or almost all you want to know about Spark partitioning. At least for a start :)

level:medium topic:hive topic:spark


Higher-Order Functions with Spark 3.1 — David Vrba @ Towards Data Science Blog.

New functions for manipulating with arrays have been released. Check maybe you can forget about UDF.

level:medium topic:spark


Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming — Tathagata Das @ Databricks Blog.

If you use Spark Streaming and are interested in handling late-arriving data, this article gives you a practical approach to which windows strategy use and how watermark can help you.

level:medium topic:streaming topic:spark topic:late-arriving-data


Creating Pandas and Spark Compatible Functions with Fugue — Kevin Kho @ Towards Data Science.

Everybody knows about Apache Arrow, which aims to create an effective in-memory storage format for the interaction of different libraries/frameworks. But what if there were such universal format not for storage, but for functions? There is one: Fugue.

level:advanced topic:fugue topic:pandas topic:spark


Using Distributed Computing for Neuroimaging — Dr. Alessandro Crimi @ Towards Data Science.

Did you ever think that data engineering may literally save lives? Well, it turns out it can. Real-world example.

level:medium topic:spark