DEV Community

Cover image for Day 27 of 100 Days of ClickHouse® - Optimizing ClickHouse® Queries for Faster Execution
Kanishga Subramani
Kanishga Subramani

Posted on

Day 27 of 100 Days of ClickHouse® - Optimizing ClickHouse® Queries for Faster Execution

Introduction

As datasets grow larger and analytical workloads become more demanding, query performance becomes increasingly important. While ClickHouse® is built for lightning-fast analytical processing, poorly optimized schemas and inefficient queries can still waste CPU cycles, consume excessive memory, and increase execution times.

The good news is that ClickHouse® provides numerous features to help you optimize both your data model and query patterns. From choosing the right ORDER BY key to using skip indexes, projections, materialized views, and efficient filtering techniques, a few design decisions can dramatically improve performance.

In this article, we'll explore practical techniques for optimizing ClickHouse® queries so your analytical workloads remain fast and scalable—even when working with billions of rows.


Why Query Optimization Matters

ClickHouse® is a column-oriented analytical database that processes data differently from traditional row-based databases.

It achieves exceptional performance through:

  • Columnar storage
  • Vectorized query execution
  • Parallel processing
  • Data compression
  • Sparse primary indexes
  • Intelligent data skipping

However, these optimizations are only effective when tables are designed correctly and queries take advantage of them.

Poorly optimized queries typically result in:

  • Increased query latency
  • Higher CPU utilization
  • Excessive memory consumption
  • Longer dashboard loading times
  • Reduced throughput
  • Higher infrastructure costs

Understanding how ClickHouse® reads and skips data is the foundation of writing efficient queries.


1. Choose a Good ORDER BY Key

The ORDER BY clause in MergeTree tables defines the physical sort order of data on disk.

Unlike many databases, this is far more than just a sorting preference—it determines how efficiently ClickHouse® can locate data using its sparse primary index.

A well-designed ORDER BY key allows ClickHouse® to skip large portions of data during query execution.

CREATE TABLE events
(
    user_id UInt32,
    event_type LowCardinality(String),
    timestamp DateTime,
    value Float64
)
ENGINE = MergeTree()
ORDER BY (user_id, event_type, timestamp);
Enter fullscreen mode Exit fullscreen mode

Now queries filtering on these columns become extremely efficient.

SELECT avg(value)
FROM events
WHERE user_id = 1001
  AND event_type = 'purchase'
  AND timestamp >= now() - INTERVAL 7 DAY;
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Put frequently filtered columns first.
  • Place high-selectivity columns earlier.
  • Avoid random or highly unique first columns unless appropriate.
  • Design ORDER BY around your most common queries.

2. Partition Large Tables

Partitioning divides data into separate physical parts.

When a query filters on the partition key, ClickHouse® skips entire partitions without reading them.

CREATE TABLE events
(
    user_id UInt32,
    event_type LowCardinality(String),
    timestamp DateTime,
    value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, event_type, timestamp);
Enter fullscreen mode Exit fullscreen mode

Query:

SELECT count()
FROM events
WHERE timestamp >= now() - INTERVAL 30 DAY;
Enter fullscreen mode Exit fullscreen mode

Instead of scanning years of data, ClickHouse® only reads recent monthly partitions.

Common Partition Keys

  • toYYYYMM(timestamp)
  • toYYYYMMDD(timestamp)
  • Region
  • Tenant ID (for multi-tenant systems)

Avoid over-partitioning, as too many small partitions can negatively impact performance.


3. Avoid SELECT *

ClickHouse® only reads the columns referenced in your query.

Using SELECT * forces the database to read and decompress every column, increasing I/O and CPU usage.

Slow:

SELECT *
FROM events
WHERE user_id = 1001;
Enter fullscreen mode Exit fullscreen mode

Better:

SELECT
    user_id,
    event_type,
    value
FROM events
WHERE user_id = 1001;
Enter fullscreen mode Exit fullscreen mode

Always retrieve only the columns your application actually needs.


4. Use LowCardinality for String Columns

Columns with relatively few unique values should use LowCardinality.

Examples include:

  • status
  • country
  • region
  • event_type
  • device_type

Instead of:

event_type String
Enter fullscreen mode Exit fullscreen mode

Use:

event_type LowCardinality(String)
Enter fullscreen mode Exit fullscreen mode

Benefits include:

  • Smaller storage
  • Better compression
  • Faster filtering
  • Faster GROUP BY
  • Reduced memory usage

As a rule of thumb, use LowCardinality for string columns with fewer than roughly 10,000 distinct values.


5. Filter on ORDER BY Columns First

ClickHouse® can only use its sparse primary index efficiently when filtering on the leading columns of the ORDER BY key.

Given:

ORDER BY (user_id, event_type, timestamp)
Enter fullscreen mode Exit fullscreen mode

Fast:

SELECT count()
FROM events
WHERE user_id = 1001
  AND event_type = 'purchase';
Enter fullscreen mode Exit fullscreen mode

Slow:

SELECT count()
FROM events
WHERE value > 100;
Enter fullscreen mode Exit fullscreen mode

If you frequently filter on non-indexed columns, consider adding a skip index.


6. Use Skip Indexes

Skip indexes help ClickHouse® skip granules when filtering on columns outside the primary key.

ALTER TABLE events
ADD INDEX idx_value value TYPE minmax GRANULARITY 4;
Enter fullscreen mode Exit fullscreen mode

Materialize it:

ALTER TABLE events
MATERIALIZE INDEX idx_value;
Enter fullscreen mode Exit fullscreen mode

Now:

SELECT count()
FROM events
WHERE value > 500;
Enter fullscreen mode Exit fullscreen mode

ClickHouse® skips granules whose minimum and maximum values fall outside the requested range.

Common skip index types include:

  • minmax
  • set
  • bloom_filter
  • tokenbf_v1

7. Avoid Functions on Filtered Columns

Applying functions to indexed columns prevents ClickHouse® from using the primary index efficiently.

Avoid:

SELECT count()
FROM events
WHERE toYear(timestamp) = 2024;
Enter fullscreen mode Exit fullscreen mode

Instead use range filters:

SELECT count()
FROM events
WHERE timestamp >= '2024-01-01'
  AND timestamp < '2025-01-01';
Enter fullscreen mode Exit fullscreen mode

Range conditions allow ClickHouse® to leverage index skipping effectively.


8. Use PREWHERE

PREWHERE is a ClickHouse® optimization that filters rows before loading all requested columns.

SELECT
    user_id,
    value
FROM events
PREWHERE value > 100
WHERE user_id = 1001;
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Reads fewer columns initially
  • Reduces disk I/O
  • Lowers memory consumption
  • Speeds up wide-table queries

Although ClickHouse® automatically applies PREWHERE in many cases, specifying it manually gives you greater control.


9. Use SAMPLE for Exploratory Queries

For approximate analytics, SAMPLE dramatically reduces execution time.

SELECT avg(value)
FROM events
SAMPLE 0.1
WHERE event_type = 'purchase';
Enter fullscreen mode Exit fullscreen mode

This reads only 10% of the data while producing statistically representative results.

Remember that SAMPLE requires a table created with a SAMPLE BY clause.


10. Insert Data in Batches

Large batch inserts are far more efficient than numerous single-row inserts.

Slow:

INSERT INTO events VALUES
(1001,'purchase',now(),99.99);

INSERT INTO events VALUES
(1002,'view',now(),0.0);
Enter fullscreen mode Exit fullscreen mode

Better:

INSERT INTO events VALUES
(1001,'purchase',now(),99.99),
(1002,'view',now(),0.0),
(1003,'click',now(),15.20);
Enter fullscreen mode Exit fullscreen mode

Batch inserts reduce merge overhead and improve ingestion throughput.


11. Avoid Expensive JOINs

Although JOIN performance has improved significantly, joins remain expensive compared to scanning a single table.

Whenever possible:

  • Denormalize frequently queried data.
  • Join only after filtering.
  • Join smaller tables to larger tables.
  • Consider Dictionaries for dimension lookups.

Instead of:

SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.id;
Enter fullscreen mode Exit fullscreen mode

Filter first:

SELECT *
FROM
(
    SELECT *
    FROM orders
    WHERE order_date >= today() - 7
) o
JOIN customers c
ON o.customer_id = c.id;
Enter fullscreen mode Exit fullscreen mode

Reducing rows before the JOIN often leads to substantial performance improvements.


12. Choose the Right MergeTree Engine

Different workloads benefit from different MergeTree engines.

Examples include:

  • MergeTree → General-purpose analytics
  • ReplacingMergeTree → Deduplication
  • SummingMergeTree → Incremental aggregation
  • AggregatingMergeTree → Precomputed aggregates
  • CollapsingMergeTree → Event state transitions
  • VersionedCollapsingMergeTree → Slowly changing records

Choosing the right engine can significantly reduce query complexity and execution time.


13. Monitor Query Performance

Use system tables to identify slow queries.

SELECT
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage
FROM system.query_log
ORDER BY event_time DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

Monitor:

  • Query duration
  • Rows read
  • Bytes read
  • Memory usage
  • Scan efficiency

Regular monitoring helps detect regressions before they affect production workloads.


14. Use LIMIT Efficiently

If users only require a subset of rows, don't scan the entire dataset.

SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

Combined with a suitable ORDER BY key, ClickHouse® can stop reading data earlier.


15. Use Appropriate Data Types

Smaller data types improve:

  • Compression
  • Cache efficiency
  • Memory utilization
  • Query speed

Instead of:

user_id UInt64
Enter fullscreen mode Exit fullscreen mode

Use:

user_id UInt32
Enter fullscreen mode Exit fullscreen mode

if appropriate.

Other recommendations:

  • UInt8 instead of UInt32 where applicable
  • Date instead of DateTime if time isn't needed
  • Enum or LowCardinality for categorical values

16. Use Compression Codecs

Column-specific codecs can further reduce storage while improving read performance.

Examples:

temperature Float32 CODEC(Gorilla)
Enter fullscreen mode Exit fullscreen mode
timestamp DateTime CODEC(DoubleDelta)
Enter fullscreen mode Exit fullscreen mode
message String CODEC(ZSTD)
Enter fullscreen mode Exit fullscreen mode

Choosing the right codec depends on the data distribution and access patterns.


17. Avoid Large OFFSET Pagination

OFFSET pagination forces ClickHouse® to scan and discard rows.

Avoid:

SELECT *
FROM events
ORDER BY timestamp
LIMIT 100 OFFSET 500000;
Enter fullscreen mode Exit fullscreen mode

Prefer keyset pagination:

SELECT *
FROM events
WHERE timestamp > '2026-01-01 12:00:00'
ORDER BY timestamp
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

Keyset pagination scales much better for large datasets.


18. Use Projections

Projections store alternative sorted layouts inside the same table.

ALTER TABLE events
ADD PROJECTION by_event_type
(
    SELECT *
    ORDER BY event_type
);
Enter fullscreen mode Exit fullscreen mode

Queries filtering by event_type can automatically use the projection, reducing scan time without changing application queries.


19. Use Materialized Views

Repeatedly calculating the same aggregations is expensive.

Materialized Views compute aggregates during data insertion.

CREATE MATERIALIZED VIEW events_mv
TO events_hourly_summary
AS
SELECT
    toStartOfHour(timestamp) AS hour,
    event_type,
    count() AS total_count,
    sum(value) AS total_value
FROM events
GROUP BY hour, event_type;
Enter fullscreen mode Exit fullscreen mode

Now dashboards can query the summary table instead of billions of raw rows.

SELECT
    hour,
    event_type,
    total_count,
    total_value
FROM events_hourly_summary
WHERE hour >= now() - INTERVAL 24 HOUR
ORDER BY hour;
Enter fullscreen mode Exit fullscreen mode

20. Use EXPLAIN to Understand Query Execution

The EXPLAIN statement helps visualize how ClickHouse® executes queries.

Basic example:

EXPLAIN
SELECT count()
FROM events
WHERE user_id = 1001;
Enter fullscreen mode Exit fullscreen mode

Useful options include:

EXPLAIN PIPELINE
SELECT count()
FROM events;
Enter fullscreen mode Exit fullscreen mode
EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE user_id = 1001;
Enter fullscreen mode Exit fullscreen mode

These commands reveal:

  • Query pipeline
  • Index usage
  • Estimated rows
  • Execution stages

They are invaluable when diagnosing slow queries.


Final Thoughts

ClickHouse® is already one of the fastest analytical databases available, but achieving consistent sub-second performance requires thoughtful schema design and efficient query patterns.

Start with the fundamentals: choose an effective ORDER BY key, partition wisely, avoid unnecessary column reads, and write filters that leverage the primary index. As your datasets grow, take advantage of advanced features such as skip indexes, PREWHERE, projections, compression codecs, materialized views, and the EXPLAIN statement to fine-tune performance.

By combining these best practices, ClickHouse® can efficiently process billions of rows with minimal latency, making it an exceptional platform for modern analytical workloads.

Small optimizations compound over time. The more closely your table design aligns with your query patterns, the greater the performance gains you'll achieve

Top comments (0)