Kanishga Subramani

Posted on Jun 20

Day 27 of 100 Days of ClickHouse® - Optimizing ClickHouse® Queries for Faster Execution

#clickhouse #devops #analytics #database

Introduction

As datasets grow larger and analytical workloads become more demanding, query performance becomes increasingly important. While ClickHouse® is built for lightning-fast analytical processing, poorly optimized schemas and inefficient queries can still waste CPU cycles, consume excessive memory, and increase execution times.

The good news is that ClickHouse® provides numerous features to help you optimize both your data model and query patterns. From choosing the right ORDER BY key to using skip indexes, projections, materialized views, and efficient filtering techniques, a few design decisions can dramatically improve performance.

In this article, we'll explore practical techniques for optimizing ClickHouse® queries so your analytical workloads remain fast and scalable—even when working with billions of rows.

Why Query Optimization Matters

ClickHouse® is a column-oriented analytical database that processes data differently from traditional row-based databases.

It achieves exceptional performance through:

Columnar storage
Vectorized query execution
Parallel processing
Data compression
Sparse primary indexes
Intelligent data skipping

However, these optimizations are only effective when tables are designed correctly and queries take advantage of them.

Poorly optimized queries typically result in:

Increased query latency
Higher CPU utilization
Excessive memory consumption
Longer dashboard loading times
Reduced throughput
Higher infrastructure costs

Understanding how ClickHouse® reads and skips data is the foundation of writing efficient queries.

1. Choose a Good ORDER BY Key

The ORDER BY clause in MergeTree tables defines the physical sort order of data on disk.

Unlike many databases, this is far more than just a sorting preference—it determines how efficiently ClickHouse® can locate data using its sparse primary index.

A well-designed ORDER BY key allows ClickHouse® to skip large portions of data during query execution.

CREATE TABLE events
(
    user_id UInt32,
    event_type LowCardinality(String),
    timestamp DateTime,
    value Float64
)
ENGINE = MergeTree()
ORDER BY (user_id, event_type, timestamp);

Now queries filtering on these columns become extremely efficient.

SELECT avg(value)
FROM events
WHERE user_id = 1001
  AND event_type = 'purchase'
  AND timestamp >= now() - INTERVAL 7 DAY;

Best Practices

Put frequently filtered columns first.
Place high-selectivity columns earlier.
Avoid random or highly unique first columns unless appropriate.
Design ORDER BY around your most common queries.

2. Partition Large Tables

Partitioning divides data into separate physical parts.

When a query filters on the partition key, ClickHouse® skips entire partitions without reading them.

CREATE TABLE events
(
    user_id UInt32,
    event_type LowCardinality(String),
    timestamp DateTime,
    value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, event_type, timestamp);

Query:

SELECT count()
FROM events
WHERE timestamp >= now() - INTERVAL 30 DAY;

Instead of scanning years of data, ClickHouse® only reads recent monthly partitions.

Common Partition Keys

toYYYYMM(timestamp)
toYYYYMMDD(timestamp)
Region
Tenant ID (for multi-tenant systems)

Avoid over-partitioning, as too many small partitions can negatively impact performance.

3. Avoid SELECT *

ClickHouse® only reads the columns referenced in your query.

Using SELECT * forces the database to read and decompress every column, increasing I/O and CPU usage.

Slow:

SELECT *
FROM events
WHERE user_id = 1001;

Better:

SELECT
    user_id,
    event_type,
    value
FROM events
WHERE user_id = 1001;

Always retrieve only the columns your application actually needs.

4. Use LowCardinality for String Columns

Columns with relatively few unique values should use LowCardinality.

Examples include:

status
country
region
event_type
device_type

Instead of:

event_type String

Use:

event_type LowCardinality(String)

Benefits include:

Smaller storage
Better compression
Faster filtering
Faster GROUP BY
Reduced memory usage

As a rule of thumb, use LowCardinality for string columns with fewer than roughly 10,000 distinct values.

5. Filter on ORDER BY Columns First

ClickHouse® can only use its sparse primary index efficiently when filtering on the leading columns of the ORDER BY key.

Given:

ORDER BY (user_id, event_type, timestamp)

Fast:

SELECT count()
FROM events
WHERE user_id = 1001
  AND event_type = 'purchase';

Slow:

SELECT count()
FROM events
WHERE value > 100;

If you frequently filter on non-indexed columns, consider adding a skip index.

6. Use Skip Indexes

Skip indexes help ClickHouse® skip granules when filtering on columns outside the primary key.

ALTER TABLE events
ADD INDEX idx_value value TYPE minmax GRANULARITY 4;

Materialize it:

ALTER TABLE events
MATERIALIZE INDEX idx_value;

Now:

SELECT count()
FROM events
WHERE value > 500;

ClickHouse® skips granules whose minimum and maximum values fall outside the requested range.

Common skip index types include:

minmax
set
bloom_filter
tokenbf_v1

7. Avoid Functions on Filtered Columns

Applying functions to indexed columns prevents ClickHouse® from using the primary index efficiently.

Avoid:

SELECT count()
FROM events
WHERE toYear(timestamp) = 2024;

Instead use range filters:

SELECT count()
FROM events
WHERE timestamp >= '2024-01-01'
  AND timestamp < '2025-01-01';

Range conditions allow ClickHouse® to leverage index skipping effectively.

8. Use PREWHERE

PREWHERE is a ClickHouse® optimization that filters rows before loading all requested columns.

SELECT
    user_id,
    value
FROM events
PREWHERE value > 100
WHERE user_id = 1001;

Benefits:

Reads fewer columns initially
Reduces disk I/O
Lowers memory consumption
Speeds up wide-table queries

Although ClickHouse® automatically applies PREWHERE in many cases, specifying it manually gives you greater control.

9. Use SAMPLE for Exploratory Queries

For approximate analytics, SAMPLE dramatically reduces execution time.

SELECT avg(value)
FROM events
SAMPLE 0.1
WHERE event_type = 'purchase';

This reads only 10% of the data while producing statistically representative results.

Remember that SAMPLE requires a table created with a SAMPLE BY clause.

10. Insert Data in Batches

Large batch inserts are far more efficient than numerous single-row inserts.

Slow:

INSERT INTO events VALUES
(1001,'purchase',now(),99.99);

INSERT INTO events VALUES
(1002,'view',now(),0.0);

Better:

INSERT INTO events VALUES
(1001,'purchase',now(),99.99),
(1002,'view',now(),0.0),
(1003,'click',now(),15.20);

Batch inserts reduce merge overhead and improve ingestion throughput.

11. Avoid Expensive JOINs

Although JOIN performance has improved significantly, joins remain expensive compared to scanning a single table.

Whenever possible:

Denormalize frequently queried data.
Join only after filtering.
Join smaller tables to larger tables.
Consider Dictionaries for dimension lookups.

Instead of:

SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.id;

Filter first:

SELECT *
FROM
(
    SELECT *
    FROM orders
    WHERE order_date >= today() - 7
) o
JOIN customers c
ON o.customer_id = c.id;

Reducing rows before the JOIN often leads to substantial performance improvements.

12. Choose the Right MergeTree Engine

Different workloads benefit from different MergeTree engines.

Examples include:

MergeTree → General-purpose analytics
ReplacingMergeTree → Deduplication
SummingMergeTree → Incremental aggregation
AggregatingMergeTree → Precomputed aggregates
CollapsingMergeTree → Event state transitions
VersionedCollapsingMergeTree → Slowly changing records

Choosing the right engine can significantly reduce query complexity and execution time.

13. Monitor Query Performance

Use system tables to identify slow queries.

SELECT
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage
FROM system.query_log
ORDER BY event_time DESC
LIMIT 20;

Monitor:

Query duration
Rows read
Bytes read
Memory usage
Scan efficiency

Regular monitoring helps detect regressions before they affect production workloads.

14. Use LIMIT Efficiently

If users only require a subset of rows, don't scan the entire dataset.

SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 100;

Combined with a suitable ORDER BY key, ClickHouse® can stop reading data earlier.

15. Use Appropriate Data Types

Smaller data types improve:

Compression
Cache efficiency
Memory utilization
Query speed

Instead of:

user_id UInt64

Use:

user_id UInt32

if appropriate.

Other recommendations:

UInt8 instead of UInt32 where applicable
Date instead of DateTime if time isn't needed
Enum or LowCardinality for categorical values

16. Use Compression Codecs

Column-specific codecs can further reduce storage while improving read performance.

Examples:

temperature Float32 CODEC(Gorilla)

timestamp DateTime CODEC(DoubleDelta)

message String CODEC(ZSTD)

Choosing the right codec depends on the data distribution and access patterns.

17. Avoid Large OFFSET Pagination

OFFSET pagination forces ClickHouse® to scan and discard rows.

Avoid:

SELECT *
FROM events
ORDER BY timestamp
LIMIT 100 OFFSET 500000;

Prefer keyset pagination:

SELECT *
FROM events
WHERE timestamp > '2026-01-01 12:00:00'
ORDER BY timestamp
LIMIT 100;

Keyset pagination scales much better for large datasets.

18. Use Projections

Projections store alternative sorted layouts inside the same table.

ALTER TABLE events
ADD PROJECTION by_event_type
(
    SELECT *
    ORDER BY event_type
);

Queries filtering by event_type can automatically use the projection, reducing scan time without changing application queries.

19. Use Materialized Views

Repeatedly calculating the same aggregations is expensive.

Materialized Views compute aggregates during data insertion.

CREATE MATERIALIZED VIEW events_mv
TO events_hourly_summary
AS
SELECT
    toStartOfHour(timestamp) AS hour,
    event_type,
    count() AS total_count,
    sum(value) AS total_value
FROM events
GROUP BY hour, event_type;

Now dashboards can query the summary table instead of billions of raw rows.

SELECT
    hour,
    event_type,
    total_count,
    total_value
FROM events_hourly_summary
WHERE hour >= now() - INTERVAL 24 HOUR
ORDER BY hour;

20. Use EXPLAIN to Understand Query Execution

The EXPLAIN statement helps visualize how ClickHouse® executes queries.

Basic example:

EXPLAIN
SELECT count()
FROM events
WHERE user_id = 1001;

Useful options include:

EXPLAIN PIPELINE
SELECT count()
FROM events;

EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE user_id = 1001;

These commands reveal:

Query pipeline
Index usage
Estimated rows
Execution stages

They are invaluable when diagnosing slow queries.

Final Thoughts

ClickHouse® is already one of the fastest analytical databases available, but achieving consistent sub-second performance requires thoughtful schema design and efficient query patterns.

Start with the fundamentals: choose an effective ORDER BY key, partition wisely, avoid unnecessary column reads, and write filters that leverage the primary index. As your datasets grow, take advantage of advanced features such as skip indexes, PREWHERE, projections, compression codecs, materialized views, and the EXPLAIN statement to fine-tune performance.

By combining these best practices, ClickHouse® can efficiently process billions of rows with minimal latency, making it an exceptional platform for modern analytical workloads.

Small optimizations compound over time. The more closely your table design aligns with your query patterns, the greater the performance gains you'll achieve

DEV Community

Day 27 of 100 Days of ClickHouse® - Optimizing ClickHouse® Queries for Faster Execution

Introduction

Why Query Optimization Matters

1. Choose a Good ORDER BY Key

Best Practices

2. Partition Large Tables

Common Partition Keys

3. Avoid SELECT *

4. Use LowCardinality for String Columns

5. Filter on ORDER BY Columns First

6. Use Skip Indexes

7. Avoid Functions on Filtered Columns

8. Use PREWHERE

9. Use SAMPLE for Exploratory Queries

10. Insert Data in Batches

11. Avoid Expensive JOINs

12. Choose the Right MergeTree Engine

13. Monitor Query Performance

14. Use LIMIT Efficiently

15. Use Appropriate Data Types

16. Use Compression Codecs

17. Avoid Large OFFSET Pagination

18. Use Projections

19. Use Materialized Views

20. Use EXPLAIN to Understand Query Execution

Final Thoughts

Top comments (0)