Articles tagged with storage-engine

Dynamic Filtering: a Critical Performance Optimization in Analytical Engines — Vladimir Ozerov @ Querify Labs Blog

Let’s continue getting acquainted with the query engine optimization techniques with developers of these query engines. Now it’s dynamic filtering time.

Distinct aggregation optimization in Apache Calcite and Trino — Querify Labs Blog

Good and detailed description of how DISTINCT is implemented in Calcite and Trino engines.

Improve federated queries with predicate pushdown in Amazon Athena — AWS Big Data Blog

Let’s talk about query optimization in Athena, especially about predicate pushdown under different databases.

Mussel — Airbnb’s Key-Value Store for Derived Data — The Airbnb Tech Blog

It’s AirBnb time to make their own database. Meet persistent, high availability and low latency key-value storage engine for accessing derived data from offline and streaming events.

A Zero-Rename Committer: Object-storage as a Destination for Apache Hadoop and Spark — Steve Loughran, Ryan Blue, Sanjay Radia, Thomas Demoor

Have you heard that S3 didn’t deliver the safe and performant operations which the file committers expect? This paper has very deep details on how Spark jobs safely use it as a destination for their work. Additionally, you will learn how exactly was performance and correctness reached in S3A.

Introduction to the Join Ordering Problem — Alexey Goncharuk @ Querify Labs Blog

Query optimization details from Querify Labs. With lots of pictures!

Replicated Log — Unmesh Joshi

This article helps you to understand what replication looks like. With a lot of details and a lot of code snippets, you can even write your own implementation of the replication log for sure.

What Every Programmer has to know about Database Storage — Alex Petrov.

In the world of Big Data, it’s important to know how Database Storage works in order to be able to pick the right tool for the job. The talk covers evaluation techniques to choose storage with the best read, write, or best suitable for your data.

Optimizing Distributed Joins: The Case of Google Cloud Spanner and DataStax Astra DB — Artem Chebotko @ DataStax Blog.

Shuffle join, broadcast join, co-located join, pre-computed join, etc. which one is better? See how Google Cloud Spanner and DataStax Astra DB optimize distributed joins.

Distributed Query Engines vs. Data Lake Engines — Patrick Pichler.

What’s the difference between Presto, Dremio, Impala, and DBLink at its core. How it can affect the architecture.

What is Cost-based Optimization? — Alexey Goncharuk @ Querify Labs Blog

Response to the popular question: what are the mythical units of the query plan cost?