#19. Back to business
Topics: Architecture, Apache Druid, Apache Kafka, Presto, storage engine, streaming
Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow — Tyler Akidau
A new white paper from Tyler Akidau (author of the very cool book "Streaming Systems") where they compare watermarks in different aspects in Google Cloud Dataflow and Apache Flink. Watermarks represent the temporal completeness of an out-of-order data stream. Reasoning about the completeness of infinite streams is one of the most critical challenges faced by stream processing systems. It’s also one of the least understood and least adequately addressed compared to other approaches for dealing with the completeness of unbounded data streams.
Designing Instagram — HighScalability Blog
Do you want to try to design Instagram with Machine Learning Lead from Amazon? Well, now you can do it.
This article is follow up to DE or DIE Meetup #9 (in Russian).
Powering real-time data analytics with Druid at Twitter — Twitter Engineering Blog
At least, now we know that Druid has out-of-the-box ingestion connectors for Apache Kafka, and it seems that it works great! Just check Twitter streaming architecture.
Presto on Apache Kafka At Uber Scale — User Engineering Blog
We like Uber engineering posts so much. Because they seem like ADRs: problem, current environment description, alternatives, supposed architecture.
Replicated Log — Unmesh Joshi
This article helps you to understand what replication looks like. With a lot of details and a lot of code snippets, you can even write your own implementation of the replication log for sure.
Upcoming Conferences
- Airflow Summit, May 23-27, https://airflowsummit.org/
- Snowflake Summit, June 13-16, https://www.snowflake.com/summit
- Data and AI Summit, June 27-30, https://databricks.com/dataaisummit/north-america-2022