#19. Back to business

Topics: Architecture, Apache Druid, Apache Kafka, Presto, storage engine, streaming

Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow — Tyler Akidau

A new white paper from Tyler Akidau (author of the very cool book "Streaming Systems") where they compare watermarks in different aspects in Google Cloud Dataflow and Apache Flink. Watermarks represent the temporal completeness of an out-of-order data stream. Reasoning about the completeness of infinite streams is one of the most critical challenges faced by stream processing systems. It’s also one of the least understood and least adequately addressed compared to other approaches for dealing with the completeness of unbounded data streams.

Designing Instagram — HighScalability Blog

Do you want to try to design Instagram with Machine Learning Lead from Amazon? Well, now you can do it.
This article is follow up to DE or DIE Meetup #9 (in Russian).

Powering real-time data analytics with Druid at Twitter — Twitter Engineering Blog

At least, now we know that Druid has out-of-the-box ingestion connectors for Apache Kafka, and it seems that it works great! Just check Twitter streaming architecture.

Presto on Apache Kafka At Uber Scale — User Engineering Blog

We like Uber engineering posts so much. Because they seem like ADRs: problem, current environment description, alternatives, supposed architecture.

Replicated Log — Unmesh Joshi

This article helps you to understand what replication looks like. With a lot of details and a lot of code snippets, you can even write your own implementation of the replication log for sure.

Upcoming Conferences

Airflow Summit, May 23-27, https://airflowsummit.org/
Snowflake Summit, June 13-16, https://www.snowflake.com/summit
Data and AI Summit, June 27-30, https://databricks.com/dataaisummit/north-america-2022

Written on May 20, 2022