#30. Go to Data Engineering
Topics: Databases, data thoughts, practices, Apache Spark, streaming
Concurrency is not Parallelism by Rob Pike (slides) — Rob Pike
Rob Pike is a software engineer, best known for his work on the Go programming language. Quite old but still relevant talk discussing the crucial distinctions between concurrency and parallelism. Such foundational engineering concepts are cross disciplines and I would say must-have knowledge for all engineers.
Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi — Uber Engineering Blog
As usual, Uber’s blog is fantastic resource for new ideas. This fresh article explains how to build streaming ETL framework and how to address some well-known issue on this way.
Apache Spark — Job monitoring — Hareesha Dandamudi
Short and understandable article about SparkListener. Quite a useful thing for creating monitoring and lineage handling (as OpenLineage does). Just use it as a start for your own articles research :)
The Question That Every Data Engineer Should Ask — Xinran Waibel @ Data Engineer Things Blog
Vital and clickbait.
Pushdown — Trino Query optimizer docs
Pushdown is a powerful query optimization that moves predicates in the WHERE clause closer to the tables they refer to. This is Trino docs, but you can go through all types and find them in other query engines. It can help you in reading query plans a lot.