#30. Go to Data Engineering

Topics: Databases, data thoughts, practices, Apache Spark, streaming

Concurrency is not Parallelism by Rob Pike (slides) — Rob Pike

Rob Pike is a software engineer, best known for his work on the Go programming language. Quite old but still relevant talk discussing the crucial distinctions between concurrency and parallelism. Such foundational engineering concepts are cross disciplines and I would say must-have knowledge for all engineers.

Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi — Uber Engineering Blog

As usual, Uber’s blog is fantastic resource for new ideas. This fresh article explains how to build streaming ETL framework and how to address some well-known issue on this way.

Apache Spark — Job monitoring — Hareesha Dandamudi

Short and understandable article about SparkListener. Quite a useful thing for creating monitoring and lineage handling (as OpenLineage does). Just use it as a start for your own articles research :)

The Question That Every Data Engineer Should Ask — Xinran Waibel @ Data Engineer Things Blog

Vital and clickbait.

Pushdown — Trino Query optimizer docs

Pushdown is a powerful query optimization that moves predicates in the WHERE clause closer to the tables they refer to. This is Trino docs, but you can go through all types and find them in other query engines. It can help you in reading query plans a lot.

Written on April 1, 2023