Introduction
As datasets grow larger and analytical workloads become more demanding, query performance becomes increasingly important. While ClickHouse® is built for lightning-fast analytical processing, poorly optimized schemas and inefficient queries can still waste CPU cycles, consume excessive memory, and increase execution times.
The good news is that ClickHouse® provides numerous features to help you optimize both your data model and query patterns. From choosing the right ORDER BY key to using skip indexes, projections, materialized views, and efficient filtering techniques, a few design decisions can dramatically improve performance.
In this article, we'll explore practical techniques for optimizing ClickHouse® queries so your analytical workloads remain fast and scalable—even when working with billions of rows.
Why Query Optimization Matters
ClickHouse® is a column-oriented analytical database that processes data differently from traditional row-based databases.
It achieves exceptional performance through:
- Columnar storage
- Vectorized query execution
- Parallel processing
- Data compression
- Sparse primary indexes
- Intelligent data skipping
However, these optimizations are only effective when tables are designed correctly and queries take advantage of them.
Poorly optimized queries typically result in:
- Increased query latency
- Higher CPU utilization
- Excessive memory consumption
- Longer dashboard loading times
- Reduced throughput
- Higher infrastructure costs
Understanding how ClickHouse® reads and skips data is the foundation of writing efficient queries.
1. Choose a Good ORDER BY Key
The ORDER BY clause in MergeTree tables defines the physical sort order of data on disk.
Unlike many databases, this is far more than just a sorting preference—it determines how efficiently ClickHouse® can locate data using its sparse primary index.
A well-designed ORDER BY key allows ClickHouse® to skip large portions of data during query execution.
CREATE TABLE events
(
user_id UInt32,
event_type LowCardinality(String),
timestamp DateTime,
value Float64
)
ENGINE = MergeTree()
ORDER BY (user_id, event_type, timestamp);
Now queries filtering on these columns become extremely efficient.
SELECT avg(value)
FROM events
WHERE user_id = 1001
AND event_type = 'purchase'
AND timestamp >= now() - INTERVAL 7 DAY;
Best Practices
- Put frequently filtered columns first.
- Place high-selectivity columns earlier.
- Avoid random or highly unique first columns unless appropriate.
- Design ORDER BY around your most common queries.
2. Partition Large Tables
Partitioning divides data into separate physical parts.
When a query filters on the partition key, ClickHouse® skips entire partitions without reading them.
CREATE TABLE events
(
user_id UInt32,
event_type LowCardinality(String),
timestamp DateTime,
value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, event_type, timestamp);
Query:
SELECT count()
FROM events
WHERE timestamp >= now() - INTERVAL 30 DAY;
Instead of scanning years of data, ClickHouse® only reads recent monthly partitions.
Common Partition Keys
toYYYYMM(timestamp)toYYYYMMDD(timestamp)- Region
- Tenant ID (for multi-tenant systems)
Avoid over-partitioning, as too many small partitions can negatively impact performance.
3. Avoid SELECT *
ClickHouse® only reads the columns referenced in your query.
Using SELECT * forces the database to read and decompress every column, increasing I/O and CPU usage.
Slow:
SELECT *
FROM events
WHERE user_id = 1001;
Better:
SELECT
user_id,
event_type,
value
FROM events
WHERE user_id = 1001;
Always retrieve only the columns your application actually needs.
4. Use LowCardinality for String Columns
Columns with relatively few unique values should use LowCardinality.
Examples include:
- status
- country
- region
- event_type
- device_type
Instead of:
event_type String
Use:
event_type LowCardinality(String)
Benefits include:
- Smaller storage
- Better compression
- Faster filtering
- Faster GROUP BY
- Reduced memory usage
As a rule of thumb, use LowCardinality for string columns with fewer than roughly 10,000 distinct values.
5. Filter on ORDER BY Columns First
ClickHouse® can only use its sparse primary index efficiently when filtering on the leading columns of the ORDER BY key.
Given:
ORDER BY (user_id, event_type, timestamp)
Fast:
SELECT count()
FROM events
WHERE user_id = 1001
AND event_type = 'purchase';
Slow:
SELECT count()
FROM events
WHERE value > 100;
If you frequently filter on non-indexed columns, consider adding a skip index.
6. Use Skip Indexes
Skip indexes help ClickHouse® skip granules when filtering on columns outside the primary key.
ALTER TABLE events
ADD INDEX idx_value value TYPE minmax GRANULARITY 4;
Materialize it:
ALTER TABLE events
MATERIALIZE INDEX idx_value;
Now:
SELECT count()
FROM events
WHERE value > 500;
ClickHouse® skips granules whose minimum and maximum values fall outside the requested range.
Common skip index types include:
- minmax
- set
- bloom_filter
- tokenbf_v1
7. Avoid Functions on Filtered Columns
Applying functions to indexed columns prevents ClickHouse® from using the primary index efficiently.
Avoid:
SELECT count()
FROM events
WHERE toYear(timestamp) = 2024;
Instead use range filters:
SELECT count()
FROM events
WHERE timestamp >= '2024-01-01'
AND timestamp < '2025-01-01';
Range conditions allow ClickHouse® to leverage index skipping effectively.
8. Use PREWHERE
PREWHERE is a ClickHouse® optimization that filters rows before loading all requested columns.
SELECT
user_id,
value
FROM events
PREWHERE value > 100
WHERE user_id = 1001;
Benefits:
- Reads fewer columns initially
- Reduces disk I/O
- Lowers memory consumption
- Speeds up wide-table queries
Although ClickHouse® automatically applies PREWHERE in many cases, specifying it manually gives you greater control.
9. Use SAMPLE for Exploratory Queries
For approximate analytics, SAMPLE dramatically reduces execution time.
SELECT avg(value)
FROM events
SAMPLE 0.1
WHERE event_type = 'purchase';
This reads only 10% of the data while producing statistically representative results.
Remember that SAMPLE requires a table created with a SAMPLE BY clause.
10. Insert Data in Batches
Large batch inserts are far more efficient than numerous single-row inserts.
Slow:
INSERT INTO events VALUES
(1001,'purchase',now(),99.99);
INSERT INTO events VALUES
(1002,'view',now(),0.0);
Better:
INSERT INTO events VALUES
(1001,'purchase',now(),99.99),
(1002,'view',now(),0.0),
(1003,'click',now(),15.20);
Batch inserts reduce merge overhead and improve ingestion throughput.
11. Avoid Expensive JOINs
Although JOIN performance has improved significantly, joins remain expensive compared to scanning a single table.
Whenever possible:
- Denormalize frequently queried data.
- Join only after filtering.
- Join smaller tables to larger tables.
- Consider Dictionaries for dimension lookups.
Instead of:
SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.id;
Filter first:
SELECT *
FROM
(
SELECT *
FROM orders
WHERE order_date >= today() - 7
) o
JOIN customers c
ON o.customer_id = c.id;
Reducing rows before the JOIN often leads to substantial performance improvements.
12. Choose the Right MergeTree Engine
Different workloads benefit from different MergeTree engines.
Examples include:
- MergeTree → General-purpose analytics
- ReplacingMergeTree → Deduplication
- SummingMergeTree → Incremental aggregation
- AggregatingMergeTree → Precomputed aggregates
- CollapsingMergeTree → Event state transitions
- VersionedCollapsingMergeTree → Slowly changing records
Choosing the right engine can significantly reduce query complexity and execution time.
13. Monitor Query Performance
Use system tables to identify slow queries.
SELECT
query_duration_ms,
read_rows,
read_bytes,
memory_usage
FROM system.query_log
ORDER BY event_time DESC
LIMIT 20;
Monitor:
- Query duration
- Rows read
- Bytes read
- Memory usage
- Scan efficiency
Regular monitoring helps detect regressions before they affect production workloads.
14. Use LIMIT Efficiently
If users only require a subset of rows, don't scan the entire dataset.
SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 100;
Combined with a suitable ORDER BY key, ClickHouse® can stop reading data earlier.
15. Use Appropriate Data Types
Smaller data types improve:
- Compression
- Cache efficiency
- Memory utilization
- Query speed
Instead of:
user_id UInt64
Use:
user_id UInt32
if appropriate.
Other recommendations:
- UInt8 instead of UInt32 where applicable
- Date instead of DateTime if time isn't needed
- Enum or LowCardinality for categorical values
16. Use Compression Codecs
Column-specific codecs can further reduce storage while improving read performance.
Examples:
temperature Float32 CODEC(Gorilla)
timestamp DateTime CODEC(DoubleDelta)
message String CODEC(ZSTD)
Choosing the right codec depends on the data distribution and access patterns.
17. Avoid Large OFFSET Pagination
OFFSET pagination forces ClickHouse® to scan and discard rows.
Avoid:
SELECT *
FROM events
ORDER BY timestamp
LIMIT 100 OFFSET 500000;
Prefer keyset pagination:
SELECT *
FROM events
WHERE timestamp > '2026-01-01 12:00:00'
ORDER BY timestamp
LIMIT 100;
Keyset pagination scales much better for large datasets.
18. Use Projections
Projections store alternative sorted layouts inside the same table.
ALTER TABLE events
ADD PROJECTION by_event_type
(
SELECT *
ORDER BY event_type
);
Queries filtering by event_type can automatically use the projection, reducing scan time without changing application queries.
19. Use Materialized Views
Repeatedly calculating the same aggregations is expensive.
Materialized Views compute aggregates during data insertion.
CREATE MATERIALIZED VIEW events_mv
TO events_hourly_summary
AS
SELECT
toStartOfHour(timestamp) AS hour,
event_type,
count() AS total_count,
sum(value) AS total_value
FROM events
GROUP BY hour, event_type;
Now dashboards can query the summary table instead of billions of raw rows.
SELECT
hour,
event_type,
total_count,
total_value
FROM events_hourly_summary
WHERE hour >= now() - INTERVAL 24 HOUR
ORDER BY hour;
20. Use EXPLAIN to Understand Query Execution
The EXPLAIN statement helps visualize how ClickHouse® executes queries.
Basic example:
EXPLAIN
SELECT count()
FROM events
WHERE user_id = 1001;
Useful options include:
EXPLAIN PIPELINE
SELECT count()
FROM events;
EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE user_id = 1001;
These commands reveal:
- Query pipeline
- Index usage
- Estimated rows
- Execution stages
They are invaluable when diagnosing slow queries.
Final Thoughts
ClickHouse® is already one of the fastest analytical databases available, but achieving consistent sub-second performance requires thoughtful schema design and efficient query patterns.
Start with the fundamentals: choose an effective ORDER BY key, partition wisely, avoid unnecessary column reads, and write filters that leverage the primary index. As your datasets grow, take advantage of advanced features such as skip indexes, PREWHERE, projections, compression codecs, materialized views, and the EXPLAIN statement to fine-tune performance.
By combining these best practices, ClickHouse® can efficiently process billions of rows with minimal latency, making it an exceptional platform for modern analytical workloads.
Small optimizations compound over time. The more closely your table design aligns with your query patterns, the greater the performance gains you'll achieve
Top comments (0)