DEV Community

Cover image for Day 26 of 100 Days of ClickHouse: Understanding Distributed Tables
Kanishga Subramani
Kanishga Subramani

Posted on

Day 26 of 100 Days of ClickHouse: Understanding Distributed Tables

As your data grows, a single ClickHouse server may eventually reach its limits. Whether it's storage, ingestion speed, or query performance, there comes a point where scaling vertically is no longer enough.

That's where Distributed Tables come in.

Distributed tables allow ClickHouse to scale horizontally across multiple servers while providing a single logical interface for querying your data. Instead of worrying about which server stores which records, you query one table, and ClickHouse takes care of the rest.

In this article, we'll explore what Distributed tables are, how they work, and why they're a key component of large-scale ClickHouse deployments.


Why Do We Need Distributed Tables?

ClickHouse is incredibly fast, and a single server can handle billions of rows. But as applications grow, organizations often need to:

  • Store more data than a single machine can hold
  • Increase data ingestion throughput
  • Speed up query execution
  • Improve fault tolerance
  • Scale across multiple servers

Once data is spread across several nodes, querying each server individually becomes impractical.

Distributed tables solve this problem by providing a single entry point for the entire cluster.


What Is a Distributed Table?

A Distributed table is a logical table that sits on top of existing local tables.

Unlike MergeTree engines, it does not store any data.

Instead, it forwards queries to the appropriate servers, collects the results, merges them, and returns a unified response.

For example:

CREATE TABLE events_distributed
AS events_local
ENGINE = Distributed(
    analytics_cluster,
    analytics,
    events_local,
    cityHash64(user_id)
);
Enter fullscreen mode Exit fullscreen mode

Here:

  • analytics_cluster is the cluster configuration.
  • analytics is the database name.
  • events_local is the table that actually stores the data.
  • cityHash64(user_id) determines how rows are distributed across shards.

The important thing to remember is that the Distributed table contains no data of its own.


Local Tables vs Distributed Tables

Every server in the cluster stores data inside its own local table.

Shard 1
└── events_local

Shard 2
└── events_local

Shard 3
└── events_local
Enter fullscreen mode Exit fullscreen mode

The Distributed table simply sits above them.

events_distributed
        │
 ┌──────┼──────┐
 │      │      │
Shard1 Shard2 Shard3
Enter fullscreen mode Exit fullscreen mode

When you run:

SELECT *
FROM events_distributed;
Enter fullscreen mode Exit fullscreen mode

ClickHouse automatically sends the query to the required shards, gathers the partial results, merges them, and returns the final output.

From the application's perspective, it feels like querying a single table.


Understanding Sharding

Distributed tables are commonly used with sharding.

Sharding means splitting data across multiple servers.

For example:

User 1 → Shard 1
User 2 → Shard 2
User 3 → Shard 3
Enter fullscreen mode Exit fullscreen mode

The server that stores each row is determined by a sharding key.

A common choice is:

cityHash64(user_id)
Enter fullscreen mode Exit fullscreen mode

A good sharding key helps:

  • Balance storage across servers
  • Distribute query workload
  • Prevent bottlenecks
  • Improve scalability

Choosing the right sharding key is one of the most important design decisions in a distributed ClickHouse cluster.


How Distributed Queries Work

Suppose you execute:

SELECT
    country,
    count()
FROM events_distributed
GROUP BY country;
Enter fullscreen mode Exit fullscreen mode

ClickHouse performs several steps automatically:

  1. Sends the query to each participating shard.
  2. Executes the query locally on every server.
  3. Collects partial results.
  4. Merges those results.
  5. Returns the final aggregated response.

Because every shard processes its own data simultaneously, queries can scale efficiently as your cluster grows.


Distributed Inserts

Distributed tables can also be used for inserting data.

INSERT INTO events_distributed
VALUES (...);
Enter fullscreen mode Exit fullscreen mode

Instead of the client deciding which server should receive each row, ClickHouse evaluates the sharding key and forwards the data to the appropriate shard automatically.

This greatly simplifies application development.


Sharding vs Replication

Many beginners confuse sharding with replication, but they solve different problems.

Sharding

  • Splits data across multiple servers
  • Improves scalability
  • Increases storage capacity
  • Distributes workload

Replication

  • Creates copies of data
  • Provides high availability
  • Protects against server failures

A production ClickHouse cluster usually combines both techniques.

Shard 1
├── Replica A
└── Replica B

Shard 2
├── Replica A
└── Replica B
Enter fullscreen mode Exit fullscreen mode

The Distributed table then acts as the single entry point for the cluster.


Best Practices

When working with Distributed tables:

  • Design your sharding key carefully.
  • Keep local table schemas identical across all nodes.
  • Filter data early to reduce network traffic.
  • Minimize unnecessary cross-shard queries.
  • Monitor shard balance and cluster health regularly.
  • Test your data distribution before moving to production.

Common Mistakes

Here are a few common misconceptions:

Distributed tables store data

They don't. Data is stored only in the local tables.

Replication and distribution are the same

Replication improves availability.

Distribution improves scalability.

Any sharding key will work

A poor sharding key can create uneven data distribution, overloaded servers, and slower queries.

Always choose a key that aligns with your application's query patterns.


Final Thoughts

Distributed tables are one of the core building blocks that make ClickHouse capable of handling massive analytical workloads.

They provide a simple abstraction that lets applications interact with a single logical table while ClickHouse manages query routing, distributed execution, and result aggregation behind the scenes.

If you're planning to build large-scale analytics platforms with ClickHouse, understanding the relationship between local tables, Distributed tables, sharding, and replication is essential.

Once these concepts click, scaling ClickHouse across multiple servers becomes much easier to understand and implement.

Thanks for reading! If you're following along with my 100 Days of ClickHouse series, stay tuned for Day 27, where we'll continue exploring another key ClickHouse concept.

Top comments (0)