Is it possible to delete old records from a ClickHouse table?

October 19, 2022 · 2 min read

The short answer is “yes”. ClickHouse has multiple mechanisms that allow freeing up disk space by removing old data. Each mechanism is aimed for different scenarios.

TTL

ClickHouse allows to automatically drop values when some condition happens. This condition is configured as an expression based on any columns, usually just static offset for any timestamp column.

The key advantage of this approach is that it does not need any external system to trigger, once TTL is configured, data removal happens automatically in background.

Note

TTL can also be used to move data not only to /dev/null, but also between different storage systems, like from SSD to HDD.

More details on configuring TTL.

DELETE FROM

DELETE FROM allows standard DELETE queries to be run in ClickHouse. The rows targeted in the filter clause are marked as deleted, and removed from future result sets. Cleanup of the rows happens asynchronously.

Note

DELETE FROM is generally available from version 23.3 and newer. On older versions, it is experimental and must be enabled with:

SET allow_experimental_lightweight_delete = true;

ALTER DELETE

ALTER DELETE removes rows using asynchronous batch operations. Unlike DELETE FROM, queries run after the ALTER DELETE and before the batch operations complete will include the rows targeted for deletion. For more details see the ALTER DELETE docs.

ALTER DELETE can be issued to flexibly remove old data. If you need to do it regularly, the main downside will be the need to have an external system to submit the query. There are also some performance considerations since mutations rewrite complete parts even there is only a single row to be deleted.

This is the most common approach to make your system based on ClickHouse GDPR-compliant.

More details on mutations.

DROP PARTITION

ALTER TABLE ... DROP PARTITION provides a cost-efficient way to drop a whole partition. It’s not that flexible and needs proper partitioning scheme configured on table creation, but still covers most common cases. Like mutations need to be executed from an external system for regular use.

More details on manipulating partitions.

TRUNCATE

It’s rather radical to drop all data from a table, but in some cases it might be exactly what you need.

More details on table truncation.

How to confirm if a Projection is used by the query?

July 10, 2022 · 2 min read

Question

How can I tell if a projection is used?

Answer

Create a sample database

CREATE database db1;

Create a sample table that will use column1 as the primary key

CREATE table db1.table1_projections
(
 column1 Int32,
 column2 Int32
)
engine = MergeTree()
order by column1;

Add a projection for_column2 to use column2 as the primary key

ALTER table db1.table1_projections add projection for_column2
(
  select * 
  order by column2
);

Insert test data

*this inserts 100000 rows with random numbers in column1 and column2

INSERT INTO db1.table1_projections
select 
 floor(randNormal(50, 5)) as column1,
 floor(randUniform(1, 100)) as column2
from numbers(100000);

Check sample set of data

clickhouse-cloud :) SELECT * from db1.table1_projections limit 5;

SELECT *
FROM db1.table1_projections
LIMIT 5

Query id: d6940799-b507-4a5e-9843-df55ebe818ab

┌─column1─┬─column2─┐
│      28 │      41 │
│      29 │      12 │
│      30 │      73 │
│      30 │      75 │
│      30 │      70 │
└─────────┴─────────┘

Test that it is using the original table with column1:

clickhouse-cloud :) explain indexes = 1 
                    SELECT count() from db1.table1_projections where column1 > 50;

EXPLAIN indexes = 1
SELECT count()
FROM db1.table1_projections
WHERE column1 > 50

Query id: e04d5236-1a05-4f1f-9502-7e41986beb44

┌─explain────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))        │
│   Aggregating                                      │
│     Expression (Before GROUP BY)                   │
│       Filter (WHERE)                               │
│         ReadFromMergeTree (db1.table1_projections) │
│         Indexes:                                   │
│           PrimaryKey                               │
│             Condition: true                        │
│             Parts: 1/1                             │
│             Granules: 12/12                        │
└────────────────────────────────────────────────────┘

*notice that it is reading from db1.table1_projections

Test reading from the projection by using column2 in the where clause

clickhouse-cloud :) explain indexes = 1 
                    SELECT * from db1.table1_projections where column2 > 50;

EXPLAIN indexes = 1
SELECT *
FROM db1.table1_projections
WHERE column2 > 50

Query id: d2b20e01-93bf-4b60-a370-4aac7b454267

┌─explain─────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│   Filter                                    │
│     ReadFromMergeTree (for_column2)         │
│     Indexes:                                │
│       PrimaryKey                            │
│         Keys:                               │
│           column2                           │
│         Condition: (column2 in [51, +Inf))  │
│         Parts: 1/1                          │
│         Granules: 6/12                      │
└─────────────────────────────────────────────┘

*notice that now the for_column2 projection is used.

For more info

Projections: https://clickhouse.com/docs/en/sql-reference/statements/alter/projection

numbers table function: https://clickhouse.com/docs/en/sql-reference/table-functions/numbers

Blog for generating random data: https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse

Python quick example using HTTP requests module

July 10, 2022 · 2 min read

Can I just run HTTP requests to ClickHouse server using requests module?

Answer

Yes, here is the sample code.

import requests
import datetime

create_replace_stmt = 'CREATE OR REPLACE TABLE test_table (name String, age UInt8) Engine=MergeTree ORDER BY tuple();'
select_query = 'SELECT count() FROM test_table'
insert_query = 'INSERT INTO test_table SELECT * FROM  generateRandom(\'name String, age UInt8\',1,1) LIMIT 300000000'

CH_URL = 'https://your_clickhouse_service_fqdn:8443'
CH_USER = 'default'
CH_PASSWORD = 'secret_pwd'

headers = {}
headers["X-ClickHouse-User"] = CH_USER
headers["X-ClickHouse-Key"] = CH_PASSWORD

now = (datetime.datetime.now())
print("{} - starting...".format(now))


# create/replace table
now = (datetime.datetime.now())
print("{} - creating/replacing table...".format(now))
response = requests.post(url=CH_URL,
                         params={"database": "default",
                                 "query": create_replace_stmt,
                                 "session_id": "my-session-id-string"
                                 },
                         headers=headers)

# select count()
response = requests.post(url=CH_URL,
                         params={"database": "default",
                                 "query": select_query,
                                 "session_id": "my-session-id-string"
                                 },
                         headers=headers)

now = (datetime.datetime.now())
print("{} - elements in test_table before insert: {}".format(
    now, response.content.decode('utf-8')))


# insert
now = (datetime.datetime.now())
print("{} - Inserting data...".format(now))
response = requests.post(url=CH_URL,
                         params={"database": "default",
                                 "query": insert_query,
                                 "session_id": "my-session-id-string",
                                 "wait_end_of_query": 1
                                 },
                         headers=headers)

now = (datetime.datetime.now())
print("{} - Done inserting data...".format(now))

response = requests.post(url=CH_URL,
                         params={"database": "default",
                                 "query": select_query,
                                 "session_id": "my-session-id-string",
                                 },
                         headers=headers)

now = (datetime.datetime.now())
print("{} - elements in test_table after insert: {}".format(
    now, response.content.decode('utf-8')))

Sample expected output:

(venv) ➜  venv/bin/python main.py
2023-07-07 14:54:27.336450 - starting...
2023-07-07 14:54:27.336476 - creating/replacing table...
2023-07-07 14:54:28.125270 - elements in test_table before insert: 0

2023-07-07 14:54:28.125352 - Inserting data...
2023-07-07 14:55:23.788466 - Done inserting data...
2023-07-07 14:55:23.962134 - elements in test_table after insert: 299115357

requirements.txt

requests==2.31.0

See this other KB for steps on how setup your python venv.

Setting a limit on query execution time

July 10, 2022 · One min read

How do I enforce a time limit on my queries?

Answer

You can use max_execution_time setting:

clickhouse-cloud :) SELECT 1 SETTINGS max_execution_time=0.0001

SELECT 1
SETTINGS max_execution_time = 0.0001

Query id: 3db752a7-b94f-4456-b3b9-ccbf290d1394

0 rows in set. Elapsed: 0.113 sec.

Received exception from server (version 23.5.1):
Code: 159. DB::Exception: Received from service.aws.clickhouse.cloud:9440. DB::Exception: Timeout exceeded: elapsed 0.000557862 seconds, maximum: 0.0001. (TIMEOUT_EXCEEDED)

Does ClickHouse support multi-region replication?

April 22, 2022 · One min read

The short answer is "yes". However, we recommend keeping latency between all regions/datacenters in two-digit range, otherwise write performance will suffer as it goes through distributed consensus protocol. For example, replication between US coasts will likely work fine, but between the US and Europe won't.

Configuration-wise there's no difference compared to single-region replication, simply use hosts that are located in different locations for replicas.

For more information, see full article on data replication.

What if I have a problem with encodings when using Oracle via ODBC?

April 22, 2022 · One min read

If you use Oracle as a source of ClickHouse external dictionaries via Oracle ODBC driver, you need to set the correct value for the NLS_LANG environment variable in /etc/default/clickhouse. For more information, see the Oracle NLS_LANG FAQ.

Example

NLS_LANG=RUSSIAN_RUSSIA.UTF8

What is a columnar database?

September 1, 2021 · 2 min read

A columnar database stores the data of each column independently. This allows reading data from disk only for those columns that are used in any given query. The cost is that operations that affect whole rows become proportionally more expensive. The synonym for a columnar database is a column-oriented database management system. ClickHouse is a typical example of such a system.

Key columnar database advantages are:

Queries that use only a few columns out of many.
Aggregating queries against large volumes of data.
Column-wise data compression.

Here is the illustration of the difference between traditional row-oriented systems and columnar databases when building reports:

Traditional row-oriented

Columnar

A columnar database is the preferred choice for analytical applications because it allows having many columns in a table just in case, but to not pay the cost for unused columns on read query execution time (a traditional OLTP database reads all of the data during queries as the data is stored in rows and not columns). Column-oriented databases are designed for big data processing and data warehousing, they often natively scale using distributed clusters of low-cost hardware to increase throughput. ClickHouse does it with combination of distributed and replicated tables.

What does “ClickHouse” mean?

September 1, 2021 · One min read

It’s a combination of “Clickstream” and “Data wareHouse”. It comes from the original use case at Yandex.Metrica, where ClickHouse was supposed to keep records of all clicks by people from all over the Internet, and it still does the job. You can read more about this use case on ClickHouse history page.

This two-part meaning has two consequences:

The only correct way to write ClickHouse is with capital H.
If you need to abbreviate it, use CH. For some historical reasons, abbreviating as CK is also popular in China, mostly because one of the first talks about ClickHouse in Chinese used this form.

Info

Many years after ClickHouse got its name, this approach of combining two words that are meaningful on their own has been highlighted as the best way to name a database in a research by Andy Pavlo, an Associate Professor of Databases at Carnegie Mellon University. ClickHouse shared his “best database name of all time” award with Postgres.

How do I contribute code to ClickHouse?

September 1, 2021 · One min read

ClickHouse is an open-source project developed on GitHub.

As customary, contribution instructions are published in CONTRIBUTING file in the root of the source code repository.

If you want to suggest a substantial change to ClickHouse, consider opening a GitHub issue explaining what you want to do, to discuss it with maintainers and community first. Examples of such RFC issues.

If your contributions are security related, please check out our security policy too.

JSON Extract example

September 1, 2021 · One min read

This is just a short example that illustrates the use of JSONExtract functions.

Create a table:

CREATE TABLE default.json_extract_example
(
    `rawJSON` String EPHEMERAL,
    `a1` String DEFAULT JSONExtractString(rawJSON, 'a1'),
    `a2` Boolean DEFAULT JSONExtractBool(rawJSON, 'a2'),
    `a3.aa1` Float DEFAULT JSONExtractFloat(JSONExtractRaw(rawJSON, 'a3'), 'aa1'),
    `a3.aa2` UInt8 DEFAULT JSONExtractUInt(JSONExtractRaw(rawJSON, 'a3'), 'aa2')
)
ENGINE = MergeTree
ORDER BY (a1, a2)

Add your JSON raw string:

INSERT INTO default.json_extract_example (rawJSON) VALUES ('{"a1": "XX", "a2": true, "a3":{"aa1":23.11,"aa2":12}}');

Query your data:

SELECT *
FROM json_extract_example
FORMAT Pretty

Yields:

┏━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ a1 ┃ a2   ┃ a3.aa1 ┃ a3.aa2 ┃
┡━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ XX │ true │  23.11 │     12 │
└────┴──────┴────────┴────────┘

Each stored as the original JSON type:

SELECT
    toTypeName(a1),
    toTypeName(a2),
    toTypeName(a3.aa1),
    toTypeName(a3.aa2)
FROM default.json_extract_example
FORMAT Pretty

┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ toTypeName(a1) ┃ toTypeName(a2) ┃ toTypeName(a3.aa1) ┃ toTypeName(a3.aa2) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ String         │ Bool           │ Float32            │ UInt8              │
└────────────────┴────────────────┴────────────────────┴────────────────────┘

TTL​

DELETE FROM​

ALTER DELETE​

DROP PARTITION​

TRUNCATE​

Question​

Answer​

Answer

Answer

TTL

DELETE FROM

ALTER DELETE

DROP PARTITION

TRUNCATE

Question

Answer