#5. Big data with a toy database

Topics: Analytics, data quality, Debezium, Apache Kafka, late arriving data, Apache Spark, streaming, SQLite.

Data Quality Roadmap Part I and Data Quality Roadmap Part II — Alexander Eliseev @ Medium.

Contrary to what we used to do — just building tests and metering time, folks from Wrike are creating the comprehensive matrix of what data quality is, how to achieve this quality and how to measure it. More than that, they’re discussing what could go wrong without these metrics! (Anything, of course).

Error Handling Patterns for Apache Kafka Applications — Gerardo Villeda @ Confluent Blog.

Do you somehow handle error messages from Kafka? Do you know the best practices? This is a great article about typical approaches to handling error messages. With our favorite confluent pictures!

Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming — Tathagata Das @ Databricks Blog.

If you use Spark Streaming and are interested in handling late-arriving data, this article gives you a practical approach to which windows strategy use and how watermark can help you.

A Visual Introduction to Debezium — Dunith Dhanushka @ Medium.

If you are not using Debezium yet but want to start to do it, read this article or just look at the pictures :)

SQLite is not a toy database — Anton Zhiyanov.

Are you surprised? Did you know you can explore CSV with SQLite? Load JSONs and CSVs, call analytical functions, and even define UDFs? It’s the real analytical engine, not a toy DB.

Written on June 18, 2021