Skip to main content

· 6 min read

The query_log table in the system database keeps track of all your queries, including:

  • how much memory the query consumed, and
  • how much CPU time was needed

The following query returns the top 10 queries, where "top" means the queries that used the most memory:

SELECT
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
ORDER BY memory_usage DESC
LIMIT 10;

The response looks like:

┌─type────────┬──────────event_time─┬─initial_query_id─────────────────────┬─memory─────┬─────userCPU─┬──systemCPU─┬─normalized_query_hash─┐
│ QueryFinish │ 2023-03-26 21:36:07 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 3.45 GiB │ 28147128901 │ 8590897697 │ 178963678599600243 │
│ QueryFinish │ 2023-03-26 21:36:04 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 1.18 GiB │ 10194162387 │ 1183376457 │ 4121209451971717712 │
│ QueryFinish │ 2023-03-26 21:36:06 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 1.16 GiB │ 10516510952 │ 1484303318 │ 4121209451971717712 │
│ QueryFinish │ 2023-03-26 21:35:59 │ 7fc488a5-838f-410d-88ee-2f492825a26b │ 1.14 GiB │ 11484580963 │ 1464145099 │ 4121209451971717712 │
│ QueryFinish │ 2023-03-26 21:47:01 │ 8119e682-a343-4847-96e7-d34ad8a748a1 │ 455.29 MiB │ 123340498 │ 8234304 │ 10687606311941357470 │
│ QueryFinish │ 2023-03-26 22:07:05 │ f2690e48-fe1e-4367-ae9d-435d962003a5 │ 377.94 MiB │ 2358130001 │ 668098391 │ 5988812223780974416 │
│ QueryFinish │ 2023-03-26 20:45:42 │ 04618222-40a1-4299-8c3d-9f050a82d849 │ 18.48 MiB │ 24676 │ 16620 │ 3205198713665290475 │
│ QueryFinish │ 2023-03-26 22:14:37 │ badf1097-5f8f-4486-88e9-3a5ac2e4734c │ 17.41 MiB │ 186234 │ 148739 │ 1910846996890686559 │
│ QueryFinish │ 2023-03-26 21:39:42 │ 8d373327-f566-4cd5-9f2c-cec75f534751 │ 16.19 MiB │ 23169 │ 12365 │ 3205198713665290475 │
│ QueryFinish │ 2023-03-26 21:35:42 │ ea672dba-7c10-4dd4-b819-cad9dccbf5d0 │ 13.97 MiB │ 20696 │ 8001 │ 3205198713665290475 │
└─────────────┴─────────────────────┴──────────────────────────────────────┴────────────┴─────────────┴────────────┴───────────────────────┘
Note

The initial_query_id represents the ID of the initial query for distributed query execution launched from the node receiving the request. The query_id contains the ID of the child query executed on a different node. See this article for more details.

You can use the query ID to extract more details about the query. Let's research our longest running query above (the first one):

SELECT query
FROM clusterAllReplicas(default, system.query_log)
WHERE initial_query_id = '7fc488a5-838f-410d-88ee-2f492825a26b'

It turns out to be the query we used to insert a few billion rows of data into a table named youtube (see the YouTube dislikes dataset):

INSERT INTO youtube
SETTINGS input_format_null_as_default = 1
SELECT
id,
parseDateTimeBestEffortUS(toString(fetch_date)) AS fetch_date,
upload_date,
ifNull(title, '') AS title,
uploader_id,
ifNull(uploader, '') AS uploader,
uploader_sub_count,
is_age_limit,
view_count,
like_count,
dislike_count,
is_crawlable,
has_subtitles,
is_ads_enabled,
is_comments_enabled,
ifNull(description, '') AS description,
rich_metadata,
super_titles,
ifNull(uploader_badges, '') AS uploader_badges,
ifNull(video_badges, '') AS video_badges
FROM s3Cluster('default','https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst', 'JSONLines')

initial_query_id VS query_id

Note that in a clustered ClickHouse environment (like ClickHouse Cloud) initial_query_id represents the ID of the initial query for distributed query execution launched from the node receiving the request; then query_id field will contain the ID of the child query executed on a different node.

If we add query_id to the above query we pin our search around initial_query_id = a7262fa2-bd8b-4b51-a359-621ccf282417 and hostname():

SELECT
hostname(),
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
WHERE initial_query_id = 'a7262fa2-bd8b-4b51-a359-621ccf282417'
ORDER BY memory_usage DESC
LIMIT 10 FORMAT Pretty;

we will get:

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-0 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 125.13 MiB │ 1754290 │ 133344 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-2 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 123.08 MiB │ 1849115 │ 123412 │ 4258439895846105173 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-1 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 93.77 MiB │ 1890981 │ 101724 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼───────────┼─────────┼───────────┼───────────────────────┤
│ server-1 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴───────────┴─────────┴───────────┴───────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-0 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
├────────────────────────┼────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────┼───────────┼───────────────────────┤
│ server-2 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 0.00 B │ 0 │ 0 │ 4258439895846105173 │
└────────────────────────┴────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────┴─────────┴───────────┴───────────────────────┘

Note we have several results from several hosts (the different cluster nodes).

To refine further and get only the child queries we could also add the query_id != initial_query_id condition to the WHERE clause:

SELECT
hostname(),
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
WHERE (query_id = initial_query_id) AND (initial_query_id = 'a7262fa2-bd8b-4b51-a359-621ccf282417')
ORDER BY memory_usage DESC
LIMIT 10 FORMAT Pretty;

returns all the child queries executed on the remote nodes (remote to the node where the query was first thrown at):

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-1 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 93.77 MiB │ 1890981 │ 101724 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-0 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 125.13 MiB │ 1754290 │ 133344 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-1 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 7dfd9297-5173-4be7-a866-d7cbe1e1abab │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-0 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 1f810b3c-b3cb-4a7b-bc6c-8c8cc1e52515 │ 0.00 B │ 0 │ 0 │ 17604798521132779336 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘

conversely, query_id = initial_query_id will return only the queries executed on the local node where the distributed query was first thrown at:

SELECT
hostname(),
type,
event_time,
initial_query_id,
query_id,
formatReadableSize(memory_usage) AS memory,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')] AS userCPU,
ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM clusterAllReplicas(default, system.query_log)
WHERE (query_id = initial_query_id) AND (initial_query_id = 'a7262fa2-bd8b-4b51-a359-621ccf282417')
ORDER BY memory_usage DESC
LIMIT 10 FORMAT Pretty;

giving:

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ hostname() ┃ type ┃ event_time ┃ initial_query_id ┃ query_id ┃ memory ┃ userCPU ┃ systemCPU ┃ normalized_query_hash ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ server-2 │ QueryFinish │ 2023-04-26 06:25:53 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 123.08 MiB │ 1849115 │ 123412 │ 4258439895846105173 │
├────────────────────────┼─────────────┼─────────────────────┼──────────────────────────────────────┼──────────────────────────────────────┼────────────┼─────────┼───────────┼───────────────────────┤
│ server-2 │ QueryStart │ 2023-04-26 06:25:52 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ a7262fa2-bd8b-4b51-a359-621ccf282417 │ 0.00 B │ 0 │ 0 │ 4258439895846105173 │
└────────────────────────┴─────────────┴─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────┴───────────┴───────────────────────┘

As for other System Tables, you can find more details about the meaning of each field in our docs.

· 2 min read

Suppose you create a table that uses the File table engine with the Parquet format:

CREATE TABLE parquet_test
(
`x` UInt32,
`y` String
)
ENGINE = File(Parquet)

You can write to the table once:

INSERT INTO parquet_test VALUES
(1, 'Hello'),
(2, 'Hi')

This creates a file named data.Parquet in the data/default/parquet_test folder. If you try to insert another batch:

INSERT INTO parquet_test VALUES
(3, 'World'),
(4, 'Bye')

...you get the following error:

Code: 641. DB::Exception: Received from localhost:9000. DB::Exception: Cannot append data in format Parquet to file, because this format doesn't support appends. You can allow to create a new file on each insert by enabling setting engine_file_allow_create_multiple_files. (CANNOT_APPEND_TO_FILE)

You can not append to Parquet files in ClickHouse. But you can tell ClickHouse to create a new file for every INSERT by enabling the engine_file_allow_create_multiple_files setting. If enabled, on each insert a new file will be created with a name following this pattern:

`data.Parquet` -> `data.1.Parquet` -> `data.2.Parquet`, etc.:

Let's give it a try. We will put our two commands into a single file named parquet.sql:

SET engine_file_allow_create_multiple_files = 1;

INSERT INTO default.parquet_test VALUES (3, 'World'), (4, 'Bye');

Run it using clickhouse-client:

./clickhouse client --queries-file parquet.sql

And now you will see two files in data/default/parquet_test (and a new file for each subsequent insert).

Note

The engine_file_allow_create_multiple_files setting applies to other data formats that are not appendable, like JSON and ORC.

· 2 min read

Question: How do I show all queries involving materialized views in the last 60m?

Answer:

This query will display all queries directed towards Materialized Views considering that:

  • we can leverage the create_table_query field in system.tables table to identify what tables are explicit (TO) recipient of MVs;
  • we can track back (using uuid and the name convention .inner_id.<uuid>) what tables are implicit recipient of MVs;

We can also configure how long back in time we want to look, by changing the value (60 m by default) in the initial query CTE

WITH(60) -- default 60m
AS timeRange,
(
--prepare names of possible implicit MV hidden target tables for *any* table with NON NULL uuid
SELECT groupArray(
concat('default.`.inner_id.', toString(uuid), '`')
)
FROM clusterAllReplicas(default, system.tables)
WHERE notEmpty(uuid)
) AS MV_implicit_possible_hidden_target_tables_names_array,
(
--captures MV name and target tables (if TO is specified)
--TODO it seems that extract will return just the first capturing group :( replace with regexpExtract once available
SELECT arrayFilter(
x->x != '',
--remove empty captures
groupArray(
extract(
create_table_query,
'^CREATE MATERIALIZED VIEW\s(\w+\.\w+)\s(?:TO\s(\S+))?'
)
)
)
FROM clusterAllReplicas(default, system.tables)
WHERE engine = 'MaterializedView'
) AS MV_explicit_target_tables_names_array
SELECT event_time,
query,
tables as "MVs tables"
FROM clusterAllReplicas(default, system.query_log)
WHERE (
-- only SELECT within 60m
event_time > now() - toIntervalMinute(timeRange)
AND startsWith(query, 'SELECT')
) -- check either that query involves implicit MV target table names
AND (
hasAny(
tables,
MV_implicit_possible_hidden_target_tables_names_array
)
OR -- check that query involves explicit MV target table
hasAny(tables, MV_explicit_target_tables_names_array)
)
ORDER BY event_time DESC;

expected output:

| event_time          | query                                                                                          | MVs tables                                                            |
| ------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| 2023-02-23 08:14:14 | SELECT rand(),* FROM default.sum_of_volumes, default.big_changes, system.users | ["default.big_changes_mv","default.sum_of_volumes_mv","system.users"] |
| 2023-02-23 08:04:47 | SELECT price,* FROM default.sum_of_volumes, default.big_changes | ["default.big_changes_mv","default.sum_of_volumes_mv"] |

In this example results above default.big_changes_mv and default.sum_of_volumes_mv are both materialized views.

· 6 min read

Question: How do I import JSON arrays and how can I query the inner objects?

Answer:

Dump this 1 line JSON array to sample.json

{"_id":"1","channel":"help","events":[{"eventType":"open","time":"2021-06-18T09:42:39.527Z"},{"eventType":"close","time":"2021-06-18T09:48:05.646Z"}]},{"_id":"2","channel":"help","events":[{"eventType":"open","time":"2021-06-18T09:42:39.535Z"},{"eventType":"edit","time":"2021-06-18T09:42:41.317Z"}]},{"_id":"3","channel":"questions","events":[{"eventType":"close","time":"2021-06-18T09:42:39.543Z"},{"eventType":"create","time":"2021-06-18T09:52:51.299Z"}]},{"_id":"4","channel":"general","events":[{"eventType":"create","time":"2021-06-18T09:42:39.552Z"},{"eventType":"edit","time":"2021-06-18T09:47:29.109Z"}]},{"_id":"5","channel":"general","events":[{"eventType":"edit","time":"2021-06-18T09:42:39.560Z"},{"eventType":"open","time":"2021-06-18T09:42:39.680Z"},{"eventType":"close","time":"2021-06-18T09:42:41.207Z"},{"eventType":"edit","time":"2021-06-18T09:42:43.372Z"},{"eventType":"edit","time":"2021-06-18T09:42:45.642Z"}]}

Check the data:

clickhousebook.local :) SELECT * FROM file('/path/to/sample.json','JSONEachRow');

SELECT *
FROM file('/path/to/sample.json', 'JSONEachRow')

Query id: 0bbfa09f-ac7f-4a1e-9227-2961b5ffc2d4

┌─_id─┬─channel───┬─events─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1 │ help │ [{'eventType':'open','time':'2021-06-18T09:42:39.527Z'},{'eventType':'close','time':'2021-06-18T09:48:05.646Z'}]
2 │ help │ [{'eventType':'open','time':'2021-06-18T09:42:39.535Z'},{'eventType':'edit','time':'2021-06-18T09:42:41.317Z'}]
3 │ questions │ [{'eventType':'close','time':'2021-06-18T09:42:39.543Z'},{'eventType':'create','time':'2021-06-18T09:52:51.299Z'}]
4 │ general │ [{'eventType':'create','time':'2021-06-18T09:42:39.552Z'},{'eventType':'edit','time':'2021-06-18T09:47:29.109Z'}]
5 │ general │ [{'eventType':'edit','time':'2021-06-18T09:42:39.560Z'},{'eventType':'open','time':'2021-06-18T09:42:39.680Z'},{'eventType':'close','time':'2021-06-18T09:42:41.207Z'},{'eventType':'edit','time':'2021-06-18T09:42:43.372Z'},{'eventType':'edit','time':'2021-06-18T09:42:45.642Z'}]
└─────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

Create a table to receive the JSON rows:

clickhousebook.local :) CREATE TABLE IF NOT EXISTS sample_json_objects_array (
`rawJSON` String EPHEMERAL,
`_id` String DEFAULT JSONExtractString(rawJSON, '_id'),
`channel` String DEFAULT JSONExtractString(rawJSON, 'channel'),
`events` Array(JSON) DEFAULT JSONExtractArrayRaw(rawJSON, 'events')
) ENGINE = MergeTree
ORDER BY
channel

CREATE TABLE IF NOT EXISTS sample_json_objects_array
(
`rawJSON` String EPHEMERAL,
`_id` String DEFAULT JSONExtractString(rawJSON, '_id'),
`channel` String DEFAULT JSONExtractString(rawJSON, 'channel'),
`events` Array(JSON) DEFAULT JSONExtractArrayRaw(rawJSON, 'events')
)
ENGINE = MergeTree
ORDER BY channel

Query id: d02696dd-3f9f-4863-be2a-b2c9a1ae922d


0 rows in set. Elapsed: 0.173 sec.

Insert the data:

clickhousebook.local :) INSERT INTO
sample_json_objects_array
SELECT
*
FROM
file(
'/opt/cases/000000/sample_json_objects_arrays.json',
'JSONEachRow'
);

INSERT INTO sample_json_objects_array SELECT *
FROM file('/opt/cases/000000/sample.json', 'JSONEachRow')

Query id: 60c4beab-3c2c-40c1-9c6f-bbbd7118dde3

Ok.

0 rows in set. Elapsed: 0.002 sec.

Check how the data inference acted on JSON object type:

clickhousebook.local :) DESCRIBE TABLE sample_json_objects_array SETTINGS describe_extend_object_types = 1;

DESCRIBE TABLE sample_json_objects_array
SETTINGS describe_extend_object_types = 1

Query id: 302c0c84-1b63-4f60-ad95-d91c0267b0d4

┌─name────┬─type────────────────────────────────────────┬─default_type─┬─default_expression─────────────────────┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ rawJSON │ String │ EPHEMERAL │ defaultValueOfTypeName('String') │ │ │ │
│ _id │ String │ DEFAULT │ JSONExtractString(rawJSON, '_id') │ │ │ │
│ channel │ String │ DEFAULT │ JSONExtractString(rawJSON, 'channel') │ │ │ │
│ events │ Array(Tuple(eventType String, time String))DEFAULT │ JSONExtractArrayRaw(rawJSON, 'events') │ │ │ │
└─────────┴─────────────────────────────────────────────┴──────────────┴────────────────────────────────────────┴─────────┴──────────────────┴────────────────┘

Events is an Array of Tuple each containing a eventType String and a time String fields. This latter type is suboptimal (we'd want DateTime instead).

Let's see the data:

clickhousebook.local :) SELECT
_id,
channel,
events.eventType,
events.time
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

SELECT
_id,
channel,
events.eventType,
events.time
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

Query id: 3ddd6843-5206-4f52-971f-1699f0ba1728

┌─_id─┬─channel───┬─events.eventType──────────────────────┬─events.time──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']['2021-06-18T09:42:39.560Z','2021-06-18T09:42:39.680Z','2021-06-18T09:42:41.207Z','2021-06-18T09:42:43.372Z','2021-06-18T09:42:45.642Z']
1 │ help │ ['open','close']['2021-06-18T09:42:39.527Z','2021-06-18T09:48:05.646Z']
3 │ questions │ ['close','create']['2021-06-18T09:42:39.543Z','2021-06-18T09:52:51.299Z']
└─────┴───────────┴───────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 0.001 sec.

Let's run a few queries:

_id and channel of events that have an eventType of value close

clickhousebook.local :) SELECT
_id,
channel,
events.eventType
FROM
sample_json_objects_array
WHERE
has(events.eventType,'close')

SELECT
_id,
channel,
events.eventType
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

Query id: 033a0c56-7bfa-4261-a334-7323bdc40f87

┌─_id─┬─channel───┬─events.eventType──────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']
1 │ help │ ['open','close']
3 │ questions │ ['close','create']
└─────┴───────────┴───────────────────────────────────────┘
┌─_id─┬─channel───┬─events.eventType──────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']
1 │ help │ ['open','close']
3 │ questions │ ['close','create']
└─────┴───────────┴───────────────────────────────────────┘

6 rows in set. Elapsed: 0.001 sec.

We want to query the time , for example all events between a given time range, but we notice it was imported as String:

clickhousebook.local :) SELECT toTypeName(events.time) FROM sample_json_objects_array;

SELECT toTypeName(events.time)
FROM sample_json_objects_array

Query id: 27f07f02-66cd-420d-8623-eeed7d501014

┌─toTypeName(events.time)─┐
│ Array(String)
│ Array(String)
│ Array(String)
│ Array(String)
│ Array(String)
└─────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

So, in order to handle these as dates, first we want to convert to DateTime. To convert an array we use a map function:

clickhousebook.local :) 
SELECT
_id,
channel,
arrayMap(x->parseDateTimeBestEffort(x), events.time)
FROM
sample_json_objects_array

SELECT
_id,
channel,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM sample_json_objects_array

Query id: f3c7881e-b41c-4872-9c67-5c25966599a1

┌─_id─┬─channel───┬─arrayMap(lambda(tuple(x), parseDateTimeBestEffort(x)), events.time)─────────────────────────────────────────────┐
4 │ general │ ['2021-06-18 11:42:39','2021-06-18 11:47:29']
5 │ general │ ['2021-06-18 11:42:39','2021-06-18 11:42:39','2021-06-18 11:42:41','2021-06-18 11:42:43','2021-06-18 11:42:45']
1 │ help │ ['2021-06-18 11:42:39','2021-06-18 11:48:05']
2 │ help │ ['2021-06-18 11:42:39','2021-06-18 11:42:41']
3 │ questions │ ['2021-06-18 11:42:39','2021-06-18 11:52:51']
└─────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

we can appreciate the diffs using toTypeName on both the arrays:

clickhousebook.local :) SELECT
_id,
channel,
toTypeName(events.time) as events_as_strings,
toTypeName(arrayMap(x->parseDateTimeBestEffort(x), events.time)) as events_as_datetime
FROM
sample_json_objects_array

SELECT
_id,
channel,
toTypeName(events.time) AS events_as_strings,
toTypeName(arrayMap(x -> parseDateTimeBestEffort(x), events.time)) AS events_as_datetime
FROM sample_json_objects_array

Query id: 1af54994-b756-472f-88d7-8b5cdca0e54e

┌─_id─┬─channel───┬─events_as_strings─┬─events_as_datetime─┐
4 │ general │ Array(String) │ Array(DateTime)
5 │ general │ Array(String) │ Array(DateTime)
1 │ help │ Array(String) │ Array(DateTime)
2 │ help │ Array(String) │ Array(DateTime)
3 │ questions │ Array(String) │ Array(DateTime)
└─────┴───────────┴───────────────────┴────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

now let's get the id of of the rows where time is between a given interval.

we use arrayCount to see if there is a count greater than than 0 of items in the array returned by the map function that will match the condition x BETWEEN toDateTime('2021-06-18 11:46:00', 'Europe/Rome') AND toDateTime('2021-06-18 11:50:00', 'Europe/Rome')

clickhousebook.local :) SELECT
_id,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM
sample_json_objects_array
WHERE
arrayCount(
x -> x BETWEEN toDateTime('2021-06-18 11:46:00', 'Europe/Rome')
AND toDateTime('2021-06-18 11:50:00', 'Europe/Rome'),
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
) > 0;

SELECT
_id,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM sample_json_objects_array
WHERE arrayCount(x -> ((x >= toDateTime('2021-06-18 11:46:00', 'Europe/Rome')) AND (x <= toDateTime('2021-06-18 11:50:00', 'Europe/Rome'))), arrayMap(x -> parseDateTimeBestEffort(x), events.time)) > 0

Query id: d4882fc3-9f99-4e87-9f89-47683f10656d

┌─_id─┬─arrayMap(lambda(tuple(x), parseDateTimeBestEffort(x)), events.time)─┐
4['2021-06-18 11:42:39','2021-06-18 11:47:29']
1['2021-06-18 11:42:39','2021-06-18 11:48:05']
└─────┴─────────────────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.002 sec.

⚠️

Please remember, at the time of writing this article the current implementation of JSON is experimental and not suited for production.

This example highlights how to quickly import JSON and start querying it and represents a tradeoff between the ease of use where we import the JSON objects as JSON type with no need to specify upfront the schema type. Convenient for a quick test however for long term use of the data we would like to, with regards to this example to store the data using the most appropriate types, so for the time field, use DateTime instead of String, in order to avoid any post-ingestion phase conversion as illustrated above. Please refer to the documentation for more about handling JSON.

· 2 min read

Question:

How can I quickly recreate a table and its data using just copy/paste across different terminals?

Answer:

This is NOT a recommended practice to migrate data from one database to another and it should NOT be used for production data migration.

This is simply intended as a quick and dirty way to recreate small amount of data when developing across multiple environments.

  1. Get the CREATE TABLE statement with SHOW CREATE table:
SHOW CREATE TABLE cookies;

SHOW CREATE TABLE cookies

Query id: 248ec8e2-5bce-45b3-97d9-ed68edf445a5

┌─statement────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
CREATE TABLE default.cookies
(
`id` String,
`timestamp` DateTime
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  1. Get the data export using FORMAT SQLInsert
SELECT * FROM cookies FORMAT SQLInsert;

SELECT *
FROM cookies
FORMAT SQLInsert

Query id: 383759b8-69c0-4561-ab95-f8224abc0071

INSERT INTO table (`id`, `timestamp`) VALUES ('4', '2023-03-15 16:28:46')
, ('2', '2023-03-15 16:28:41')
, ('1', '2023-03-15 16:11:02'), ('1', '2023-03-15 16:11:40'), ('1', '2023-03-15 16:11:48'), ('1', '2023-03-15 16:16:05'), ('2', '2023-03-15 16:11:06'), ('3', '2023-03-15 16:11:12'), ('3', '2023-03-15 16:11:45'), ('3', '2023-03-15 16:16:08'), ('4', '2023-03-15 16:11:14'), ('4', '2023-03-15 16:11:50'), ('4', '2023-03-15 16:16:01'), ('5', '2023-03-15 16:11:18'), ('5', '2023-03-15 16:16:11')
;

15 rows in set. Elapsed: 0.023 sec.

Note you will need to replace the name table at point 2 with the actual table name (cookies in this example)

· One min read

I want to export data segregating the path in S3 bucket to follow a structure like:

  • 2022
    • 1
    • 2
    • ...
    • 12
  • 2021
    • 1
    • 2
    • ...
    • 12

and so on ...

Answer

Considering the ClickHouse table:

CREATE TABLE sample_data (
`name` String,
`age` Int,
`time` DateTime
) ENGINE = MergeTree
ORDER BY
name

Add 10000 entries:

INSERT INTO
sample_data
SELECT
*
FROM
generateRandom(
'name String, age Int, time DateTime',
10,
10,
10
)
LIMIT
10000;

Run this to create the desired structure in s3 bucket my_bucket (note this example writes files in parquet format):

INSERT INTO
FUNCTION s3(
'https://s3-host:4321/my_bucket/{_partition_id}/file.parquet.gz',
's3-access-key',
's3-secret-access-key',
Parquet,
'name String, age Int, time DateTime'
) PARTITION BY concat(
formatDateTime(time, '%Y'),
'/',
formatDateTime(time, '%m')
)
SELECT
name,
age,
time
FROM
sample_data
Query id: 55adcf22-f6af-491e-b697-d09694bbcc56

Ok.

0 rows in set. Elapsed: 15.579 sec. Processed 10.00 thousand rows, 219.93 KB (641.87 rows/s., 14.12 KB/s.)

· 3 min read

ClickHouse has a built-in debugger and introspection capabilities. For example, you can get the stack traces of every server's thread at runtime by querying the system.stack_trace table:

SELECT
count(),
arrayStringConcat(arrayMap(x -> concat(demangle(addressToSymbol(x)), '\n ', addressToLine(x)), trace), '\n') AS sym
FROM system.stack_trace
GROUP BY trace
ORDER BY count() DESC
LIMIT 10
FORMAT Vertical
SETTINGS allow_introspection_functions = 1;

The query result will show the locations in the ClickHouse source code where the threads are running or waiting. (You will need to set allow_introspection_functions to 1 to enable the introspection functions.) The response looks like:

Row 1:
──────
count(): 144
sym: pthread_cond_wait

DB::BackgroundSchedulePool::threadFunction()
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 2:
──────
count(): 80
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::OrdinaryRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 3:
──────
count(): 55
sym: pthread_cond_wait

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 4:
──────
count(): 16
sym:

DB::AsynchronousInsertQueue::processBatchDeadlines(unsigned long)
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 5:
──────
count(): 16
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 6:
──────
count(): 10
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
Poco::Net::TCPServer::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 7:
──────
count(): 9
sym: pthread_cond_wait

ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 8:
──────
count(): 7
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
DB::ReadBufferFromPocoSocket::poll(unsigned long) const
/usr/bin/clickhouse
DB::TCPHandler::runImpl()
/usr/bin/clickhouse
DB::TCPHandler::run()
/usr/bin/clickhouse

/usr/bin/clickhouse
Poco::Net::TCPServerConnection::start()
/usr/bin/clickhouse
Poco::Net::TCPServerDispatcher::run()
/usr/bin/clickhouse
Poco::PooledThread::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 9:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runCleanupThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 10:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runMainThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


10 rows in set. Elapsed: 0.026 sec.
Note

If you installed ClickHouse from a .deb/.rpm/.tgz you can also install the package with the debug info to see the line numbers from the source code:

sudo apt install clickhouse-common-static-dbg

If you've installed ClickHouse as a single-binary, it already contains the debug info.

Tip

For more high-level information, check out some of these other system tables:

And there is handy info in the other system tables also.

· 2 min read

Using INTO OUTFILE Clause

Add an INTO OUTFILE clause to your query.

For example:

SELECT * FROM table INTO OUTFILE 'file'

By default, ClickHouse uses the file extension of the filename to deteremine the output format and compression. For example, all of the rows in nyc_taxi will be exported to the nyc_taxi.parquet using the Parquet format:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.parquet'

And the following file will be a compressed, tab-separated file:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.tsv.gz'

If ClickHouse can not determine the format from the file extension, then the output format defaults to TabSeparated for output data. To specify the output format, use the FORMAT clause.

For example:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.txt'
FORMAT CSV

Using the File table engine

Another option is to use the File table engine, where ClickHouse uses the file to store the data. You can perform queries and inserts directly on the file.

For example:

CREATE TABLE my_table (
x UInt32,
y String,
z DateTime
)
ENGINE = File(Parquet)

Insert a few rows:

INSERT INTO my_table VALUES
(1, 'Hello', now()),
(2, 'World', now()),
(3, 'Goodbye', now())

The file is stored in the data folder of your ClickHouse server - specifically in /data/default/my_table in a file named data.Parquet.

Note

Using the File table engine is incredibly handy for creating and querying files on your file system, but keep in mind that File tables are not MergeTree tables, so you don't get all the benefits that come with MergeTree. Use File for convenience when exporting data out of ClickHouse in convenient formats.

Using Command-Line Redirection

$ clickhouse-client --query "SELECT * from table" --format FormatName > result.txt

See clickhouse-client.

· 3 min read

Below are some basics of using the S3 table engine to read parquet files.

  • create access and secret keys for an IAM service user. normal login users usually don't work since they may have been configured with an MFA policy.

  • set the permissions on the policy to allow the service user to access the bucket and folders.

The following is a very simple example that you can use to test the mechanics of accessing your parquet files successfully prior to applying to your actual data.

If you need an example of creating a user and bucket, you can follow the first two sections (create user and create bucket): https://clickhouse.com/docs/en/guides/sre/configuring-s3-for-clickhouse-use/

I used this sample file: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet and uploaded it to my test bucket

You can set the policy something like this on the bucket: (adjust as needed, this one is fairly open for privileges but will help in testing. you can narrow your permissions as necessary)

{
"Version": "2012-10-17",
"Id": "Policy123456",
"Statement": [
{
"Sid": "abc123",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::1234567890:user/mars-s3-user"
]
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::mars-doc-test",
"arn:aws:s3:::mars-doc-test/*"
]
}
]
}

You can run queries with this type of syntax using the S3 table engine: https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

clickhouse-cloud :)  select count(*) from s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet','ABC123', 'abc+123', 'Parquet', 'first_name String');

SELECT count(*)
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'first_name String')

Query id: fd4f1193-d604-4ac0-9a46-bdd2d5e14727

┌─count()─┐
│ 1000 │
└─────────┘

1 row in set. Elapsed: 1.274 sec. Processed 1.00 thousand rows, 14.64 KB (784.81 rows/s., 11.49 KB/s.)

The data types reference for parquet format are here: https://clickhouse.com/docs/en/interfaces/formats/#data-format-parquet

To bring in the data into a native ClickHouse table:

create the table, something like this (just chose a couple of the columns in the parquet file):

clickhouse-cloud :) CREATE TABLE my_parquet_table (id UInt64, first_name String) ENGINE = MergeTree ORDER BY id;

CREATE TABLE my_parquet_table
(
`id` UInt64,
`first_name` String
)
ENGINE = MergeTree
ORDER BY id

Query id: 412e3994-bf8e-444e-ac43-a7c82642b7da

Ok.

0 rows in set. Elapsed: 0.600 sec.

Select the data from the S3 bucket to insert into the new table:

clickhouse-cloud :) INSERT INTO my_parquet_table (id, first_name) SELECT id, first_name FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123','abc+123', 'Parquet', 'id UInt64, first_name String') FORMAT Parquet

INSERT INTO my_parquet_table (id, first_name) SELECT
id,
first_name
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'id UInt64, first_name String')

Query id: c3cdc871-f338-462d-8797-6751b45a0b58

Ok.

0 rows in set. Elapsed: 1.220 sec. Processed 1.00 thousand rows, 22.64 KB (819.61 rows/s., 18.56 KB/s.)

Verify the import:

clickhouse-cloud :) SELECT * FROM my_parquet_table LIMIT 10;

SELECT *
FROM my_parquet_table
LIMIT 10

Query id: 1ccf59dd-d804-46a9-aadd-ed5c57b9e1a0

┌─id─┬─first_name─┐
│ 1 │ Amanda │
│ 2 │ Albert │
│ 3 │ Evelyn │
│ 4 │ Denise │
│ 5 │ Carlos │
│ 6 │ Kathryn │
│ 7 │ Samuel │
│ 8 │ Harry │
│ 9 │ Jose │
│ 10 │ Emily │
└────┴────────────┘

When you are ready to import your real data, you can use some special syntax like wildcards and ranges to specify your folders, subfolders and files in your bucket. I'd recommend to filter a few directories and files to test the import, maybe a certain year, a couple months and some date range to test first.

besides the path options here, newly released is syntax ** which specifies all subdirectories recursively. https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

For example, assuming the paths and bucket structure is something like this: https://your_s3_bucket.s3.amazonaws.com/<your_folder>/<year>/<month>/<day>/<filename>.parquet https://mars-doc-test.s3.amazonaws.com/system_logs/2022/11/01/my-app-logs-0001.parquet

This would get all files for 1st day of every month in 2021-2022 https://mars-doc-test.s3.amazonaws.com/system_logs/{2021-2022}/**/01/*.parquet

· 3 min read

You can use clickhouse-local to convert files between any of the input and output formats that ClickHouse supports (which is over 70 different formats!). In this article, we are convert a Parquet file in S3 into a CSV and JSON file.

Let's start at the beginning. ClickHouse has a collection of table functions that read from files, databases and other resoures and converts the data to a table. To demonstrate, suppose we have a Parquet file in S3. We will use the s3 table function to read it (ClickHouse knows it's a Parquet file based on the filename).

But first, let's download the clickhouse binary:

curl https://clickhouse.com/ | sh

Accessing the data using a table function

Let's verify we can read the file by using DESCRIBE on the resulting table that the s3 table function creates:

./clickhouse local -q "DESCRIBE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')"

This particular file contains home prices of properties sold in the United Kingdom. The response looks like:

price   Nullable(Int64)
date Nullable(UInt16)
postcode1 Nullable(String)
postcode2 Nullable(String)
type Nullable(String)
is_new Nullable(UInt8)
duration Nullable(String)
addr1 Nullable(String)
addr2 Nullable(String)
street Nullable(String)
locality Nullable(String)
town Nullable(String)
district Nullable(String)
county Nullable(String)

You can run any query you want on the data. For example, let's see which towns have the highest average price of homes:

./clickhouse local -q "SELECT
town,
avg(price) AS avg_price
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
GROUP BY town
ORDER BY avg_price DESC
LIMIT 10"

The response looks like:

GATWICK 16818750
CHALFONT ST GILES 938090.0985915493
VIRGINIA WATER 789301.1320224719
COBHAM 699874.7111622555
BEACONSFIELD 677247.5483146068
ESHER 616004.6888297872
KESTON 607585.8597560975
GERRARDS CROSS 566330.2959086584
ASCOT 551491.2975753123
WEYBRIDGE 548974.828692494

Convert the Parquet file to a CSV

You can send the result of any SQL query to a file. Let's grab all the columns from our Parquet file in S3 and send the output to a new CSV file. Because the output file ends in .csv, ClickHouse knows to use the CSV output format:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.csv'"

Let's verify it worked:

$ tail house_prices.csv
70000,10508,"YO8","9XN","detached",0,"freehold","7","","POPPY CLOSE","SELBY","SELBY","SELBY","NORTH YORKSHIRE"
130000,14274,"YO8","9XP","detached",0,"freehold","10","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
150000,18180,"YO8","9XP","detached",0,"freehold","11","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
157000,18088,"YO8","9XP","detached",0,"freehold","12","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
134000,17333,"YO8","9XP","semi-detached",0,"freehold","16","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
250000,13405,"YO8","9YA","detached",0,"freehold","6","","YORKDALE COURT","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
59500,11166,"YO8","9YB","semi-detached",0,"freehold","4","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
142500,17648,"YO8","9YB","semi-detached",0,"freehold","4A","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
230000,15125,"YO8","9YD","detached",0,"freehold","1","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
250000,15950,"YO8","9YD","detached",0,"freehold","3","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"

Convert the Parquet file to a JSON

To convert the Parquet file to JSON, simply change the extension on the output filename:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.ndjson'"

Let's verify it worked:

 $ tail house_prices.ndjson
{"price":"70000","date":10508,"postcode1":"YO8","postcode2":"9XN","type":"detached","is_new":0,"duration":"freehold","addr1":"7","addr2":"","street":"POPPY CLOSE","locality":"SELBY","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"130000","date":14274,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"10","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"150000","date":18180,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"11","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"157000","date":18088,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"12","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"134000","date":17333,"postcode1":"YO8","postcode2":"9XP","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"16","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":13405,"postcode1":"YO8","postcode2":"9YA","type":"detached","is_new":0,"duration":"freehold","addr1":"6","addr2":"","street":"YORKDALE COURT","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"59500","date":11166,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"142500","date":17648,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4A","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"230000","date":15125,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"1","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":15950,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"3","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}

Convert CSV to Parquet

It works both ways - we can easily read in the new CSV file and output it into a Parquet file. The local file house_prices.csv can be read in ClickHouse using the file table function, and ClickHouse outputs the file in Parquet format based on the filename ending in .parquet (or we could have added the FORMAT Parquet clause):

./clickhouse local -q "SELECT *
FROM file('house_prices.csv')
INTO OUTFILE 'house_prices.parquet'"

As we mentioned above, you can use any of the ClickHouse input and output formats along with clickhouse local to easily convert files into different formats.