#30. Go to Data Engineering

Topics: Databases, data thoughts, practices, Apache Spark, streaming


Concurrency is not Parallelism by Rob Pike (slides) — Rob Pike

Rob Pike is a software engineer, best known for his work on the Go programming language. Quite old but still relevant talk discussing the crucial distinctions between concurrency and parallelism. Such foundational engineering concepts are cross disciplines and I would say must-have knowledge for all engineers.

level:medium topic:practices type:video


Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi — Uber Engineering Blog

As usual, Uber’s blog is fantastic resource for new ideas. This fresh article explains how to build streaming ETL framework and how to address some well-known issue on this way.

level:medium topic:streaming


Apache Spark — Job monitoring — Hareesha Dandamudi

Short and understandable article about SparkListener. Quite a useful thing for creating monitoring and lineage handling (as OpenLineage does). Just use it as a start for your own articles research :)

level:medium topic:spark


The Question That Every Data Engineer Should Ask — Xinran Waibel @ Data Engineer Things Blog

Vital and clickbait.

level:medium topic:data-thoughts


Pushdown — Trino Query optimizer docs

Pushdown is a powerful query optimization that moves predicates in the WHERE clause closer to the tables they refer to. This is Trino docs, but you can go through all types and find them in other query engines. It can help you in reading query plans a lot.

level:advanced topic:databases


Written on April 1, 2023