Skip to main content

This Week in Databend #102

Databend is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com .

What's On In Databend

Stay connected with the latest news about Databend.

Creating Bloom Indexes for Specified Columns

Creating bloom indexes consumes a significant amount of CPU resources. For wide tables, where only a few columns may require point queries or data ingestion performance is more important, creating bloom indexes for all columns may not be a good idea.

The bloom_index_columns option was introduced to the Databend Fuse Engine, allowing you to specify which columns you want to create bloom indexes for.

To create a table with bloom indexes:

CREATE TABLE table_name (
column_name1 column_type1,
column_name2 column_type2,
...
) ... bloom_index_columns='columnName1[, ...]'.

To create or modify bloom indexes for an existing table:

After modifying the Bloom index options, Databend will not create indexes for existing data. The changes will only affect the subsequent data.
ALTER TABLE <db.table_name> SET OPTIONS(bloom_index_columns='columnName1[, ...]');

To disable the bloom indexing:

ALTER TABLE <db.table_name> SET OPTIONS(bloom_index_columns='');

If you are interested in learning more, please check out the resources listed below.

Databend SQL Conformance

Databend aims to conform to the SQL standard, with particular support for ISO/IEC 9075:2011, also known as SQL:2011. Databend incorporates many features required by the SQL standard, often with slight differences in syntax or function.

We have summarized the level of conformity of Databend to the SQL:2011 standard, hoping it can help you further understand Databend's SQL Conformance.

If you are interested in learning more, please check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Understanding Databend Recluster Pipeline

A well-clustered table may become chaotic in some storage blocks negatively affecting the query performance. For example, the table continues to have DML operations (INSERT / UPDATE / DELETE).

The re-clustering operation does not cluster the table from the ground up. It selects and reorganizes the most chaotic existing storage blocks by calculating based on the clustering algorithm.

The recluster pipeline is as follows:

┌──────────┐     ┌───────────────┐     ┌─────────┐
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┐
└──────────┘ └───────────────┘ └─────────┘ │
┌──────────┐ ┌───────────────┐ ┌─────────┐ │ ┌──────────────┐ ┌─────────┐
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┤────►│MultiSortMerge├────►│Resize(N)├───┐
└──────────┘ └───────────────┘ └─────────┘ │ └──────────────┘ └─────────┘ │
┌──────────┐ ┌───────────────┐ ┌─────────┐ │ │
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┘ │
└──────────┘ └───────────────┘ └─────────┘ │
┌──────────────────────────────────────────────────────────────────────────────────────────────┘
│ ┌──────────────┐
│ ┌───►│SerializeBlock├───┐
│ │ └──────────────┘ │
│ │ ┌──────────────┐ │ ┌─────────┐ ┌────────────────┐ ┌─────────────────┐ ┌──────────┐
└───►│───►│SerializeBlock├───┤───►│Resize(1)├───►│SerializeSegment├────►│TableMutationAggr├────►│CommitSink│
│ └──────────────┘ │ └─────────┘ └────────────────┘ └─────────────────┘ └──────────┘
│ ┌──────────────┐ │
└───►│SerializeBlock├───┘

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Accelerate CTE through Materialization

Inlining Common Table Expression (CTE) is a good idea, but if the CTE is particularly heavy, such as in TPCH Q15, the cost can be prohibitively expensive. In this case, it's better to introduce materialization for the expensive CTE.

--- TPCH Q15
WITH revenue AS
(SELECT l_suppkey AS supplier_no,
sum(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM lineitem
WHERE l_shipdate >= TO_DATE ('1996-01-01')
AND l_shipdate < TO_DATE ('1996-04-01')
GROUP BY l_suppkey)
SELECT s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM supplier,
revenue
WHERE s_suppkey = supplier_no
AND total_revenue =
(SELECT max(total_revenue)
FROM revenue)
ORDER BY s_suppkey;

Issue #12067 | Feature: speed up CTE by materialization

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.14-nightly...v1.2.25-nightly


Contributors

A total of 20 contributors participated

We are very grateful for the outstanding work of the contributors.