73 posts tagged with "weekly"

This Week in Databend #141

Databend is a modern cloud data warehouse, serving your massive-scale analytics needs at low cost and complexity. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com .

What's New

Stay informed about the latest features of Databend.

Enhancing Full-Text Search

Databend has enhanced the full-text search capabilities by extending and optimizing the match and query functions. This allows users to perform precise searches and retrieve data based on complex requirements.

This Week in Databend #140

What's New

Stay informed about the latest features of Databend.

Supporting `EXECUTE IMMEDIATE`

Databend has now incorporated EXECUTE IMMEDIATE, enabling users to execute complex SQL logic within a single statement for more flexible SQL handling capabilities.

This Week in Databend #139

What's New

Stay informed about the latest features of Databend.

Supporting `CHANGES` Clause

Databend now supports the CHANGES clause, which is used for retrieving metadata about changes across transaction boundaries.

The CHANGES clause allows querying the metadata of changes in tables over specified intervals without the need to create streams with explicit transaction offsets.

This Week in Databend #138

What's New

Stay informed about the latest features of Databend.

Supporting Multi-Table Insert

Databend now supports multi-table inserts, allowing data to be inserted into multiple different tables according to actual needs. This includes support for both unconditional and conditional insert modes.

This Week in Databend #137

What's New

Stay informed about the latest features of Databend.

Adding Support for Query Matching Inverted Index

Inverted indexes are the most commonly used data structures in document retrieval systems, storing the mapping of where a word is located in a document or a set of documents in full text search.

Databend now supports the use of match to match inverted indexes in queries.

This Week in Databend #136

What's New

Stay informed about the latest features of Databend.

Understanding Tasks and Notifications in Databend

Databend now supports a comprehensive mechanism for tasks and notifications.

Tasks are executed according to a schedule or based on a DAG of tasks, executing specified SQL statements. With notification integrations, notifications can be sent to external messaging services.

This Week in Databend #135

What's New

Stay informed about the latest features of Databend.

Supporting Query Queue

Databend now supports the query queue feature, which allows you to set max_running_queries to determine the maximum number of concurrently running queries. If there are more queries, they will be queued until processing resources are freed up.

For example, max_running_queries=3 means that up to 3 queries can run concurrently, and any new queries will be blocked until one of the previous queries is completed.

This Week in Databend #134

What's New

Stay informed about the latest features of Databend.

Supporting Multi-Statement Transactions

Databend now supports basic multi-statement transaction capabilities.

The use cases for multi-statement transactions include:

Performing DML changes on multiple tables within a single transaction.
Executing changes on a single table in multiple stages based on intermediate computations.

This Week in Databend #133

What's New

Stay informed about the latest features of Databend.

Understanding Engine Support for Open Table Formats

Databend supports different types of open table formats through table engines to meet the advanced analytical needs of various technology stack data lake solutions.

Currently, Databend offers support for Apache Iceberg and Delta Lake, two of the most popular open table formats, through its table engines. Refer to the following for usage:

This Week in Databend #132

What's New

Stay informed about the latest features of Databend.

Supporting `CREATE [ OR REPLACE ]`

Databend now offers comprehensive support for the syntactic sugar CREATE [ OR REPLACE ] to replace DROP IF EXISTS ... + CREATE ... use cases.

Objects that currently support this syntactic sugar include: DATABASE, TABLE, VIEW, AGGREGATING INDEX, STREAM, CONNECTION, FUNCTION, FILE FORMAT, and MASKING POLICY.

This Week in Databend #131

What's New

Stay informed about the latest features of Databend.

Supporting GEO Data Types

This week, Databend has added preliminary support for geographic data types to better handle the storage and manipulation of geographic data. This implementation uses EWKB to serialize geographic data and EWKT to display it.

Previously, Databend already supported a large number of geographic data-related functions, and we will continue to advance our support for processing geographic data to provide a rich set of queries and operations for geographic locations.

This Week in Databend #130

What's New

Stay informed about the latest features of Databend.

Supporting `CREATE OR REPLACE DATABASE`

CREATE OR REPLACE DATABASE is syntactic sugar designed to simplify the following SQL statements:

DROP DATABASE IF EXISTS ...
CREATE DATABASE ...

Support for CREATE OR REPLACE TABLE is also actively underway.

This Week in Databend #129

What's New

Stay informed about the latest features of Databend.

Supporting Standard Streams

Standard Streams are capable of capturing all types of data changes, including updates and deletions. Compared to append-only streams, standard streams are more comprehensive and suitable for more complex data change scenarios.

This Week in Databend #128

What's New

Stay informed about the latest features of Databend.

Querying Data on HuggingFace File System with Databend

Hugging Face is currently the most popular AI community. Databend now supports direct queries and analysis of massive data and models stored on the Hugging Face file system.

URI format: hf://{repo_id}/path/to/file, where a repo_id might look like fka/awesome-chatgpt-prompts.

Supported configurations include:

repo_type: The type of HuggingFace repository, default is dataset, available options are dataset, model.
revision: The revision version of HuggingFace, default is main. Can be a branch, tag, or commit in the repository.
token: The API token of HuggingFace.

This Week in Databend #127

What's New

Stay informed about the latest features of Databend.

Understanding Databend's Access Control

The ownership mechanism in Databend has recently covered UDF and stage, and we have also launched introductory documentation on the access control model.

Databend combines the Role-Based Access Control (RBAC) and Discretionary Access Control (DAC) models to implement access control features. When users access data objects in Databend, they must be granted appropriate permissions or roles, or they need to have ownership of the data objects. Data objects refer to various elements, such as databases, tables, views, stages, or UDFs.

This Week in Databend #126

What's New

Stay informed about the latest features of Databend.

New Filter Execution Framework

In the new filter execution framework, Databend introduces a groundbreaking concept, defining it as the "Immutable Index".

🚀 The Immutable Index enables us to avoid generating temporary selection buffer when encountering AND and OR operations. This not only reduces memory fragmentation but It also eliminates the cyclic copying from temporary selection to final selection.

This Week in Databend #125

What's New

Stay informed about the latest features of Databend.

Password Policy

Databend has now preliminary support for custom password policies (PASSWORD POLICY) to meet users' security needs. By setting a password policy, you can implement access denial after attempting wrong passwords and set password complexity requirements.

This Week in Databend #124

What's New

Stay informed about the latest features of Databend.

Compatibility Added for Delta and Iceberg Table Engines

Databend now supports Delta and Iceberg table engines, enabling seamless integration of tables managed across diverse data lakes. This example uses the Delta table engine and a CONNECTION to create a Delta table stored on S3:

create connection my_s3_conn storage_type = 's3' access_key_id ='minioadmin' secret_access_key ='minioadmin' ENDPOINT_URL='http://127.0.0.1:9900';

create table test_delta engine = delta location = 's3://testbucket/admin/data/delta/delta-table/' connection_name = 'my_s3_conn'

Currently, Databend is limited to basic read operations when interacting with tables created using the Delta table engine.

If you would like to learn more, please contact the Databend team or refer to the resources listed below:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

ClickHouse HTTP Handler Disabled by Default

As the ClickHouse HTTP Handler is now disabled by default in Databend, a new setting, enable_clickhouse_handler, has been introduced to control its activation, with a default value of 0.

If you need to use the ClickHouse HTTP Handler, execute the following command to enable it:

set global enable_clickhouse_handler=1;

PR #13965 | feat: add setting to disable clickhouse handler

Replacing Bincode with Borsh

Due to potential binary inconsistencies with Bincode across different versions and usage scenarios, Databend decided to replace Bincode with Borsh for serializing aggregate function states.

Borsh has a format that is almost identical to the format used by Databend for aggregate states. It provides consistency guarantees, follows safe coding practices, and has clear format specifications and good performance.

PR #13997 | refactor: using borsh instead of bincode for serde agg function state

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for spilling Sort when reaching memory limits.
Added support for internal stream columns.
Added Support for SHOW LOCKS.
Read the documentation Docs | system.streams and Docs | system.stream_status to learn how to use system tables for obtaining stream-related information.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Introducing Password Policy

Databend is introducing password policy in response to user security needs. This will allow the implementation of access denials following incorrect password attempts and the enforcement of password complexity requirements.

CREATE PASSWORD POLICY [ IF NOT EXISTS ] <name>
  [ PASSWORD_MIN_LENGTH = <integer> ]
  [ PASSWORD_MAX_LENGTH = <integer> ]
  [ PASSWORD_MIN_UPPER_CASE_CHARS = <integer> ]
  [ PASSWORD_MIN_LOWER_CASE_CHARS = <integer> ]
  [ PASSWORD_MIN_NUMERIC_CHARS = <integer> ]
  [ PASSWORD_MIN_SPECIAL_CHARS = <integer> ]
  [ PASSWORD_MIN_AGE_DAYS = <integer> ]
  [ PASSWORD_MAX_AGE_DAYS = <integer> ]
  [ PASSWORD_MAX_RETRIES = <integer> ]
  [ PASSWORD_LOCKOUT_TIME_MINS = <integer> ]
  [ PASSWORD_HISTORY = <integer> ]
  [ COMMENT = '<string_literal>' ]

Issue #13994 | feat: PASSWORD POLICY

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.248-nightly...v1.2.258-nightly

This Week in Databend #123

What's New

Stay informed about the latest features of Databend.

MERGE INTO now Shows Statistics

Databend now shows statistics for MERGE INTO, returning the number of affected rows after updates, deletes, and inserts.

🐳 :) create table t1(a int);
🐳 :) create table t2(b int);
🐳 :) insert into t1 values(1),(3);
🐳 :) insert into t2 values(1),(3),(4);
🐳 :) merge into t1 using t2 on t1.a = t2.a when matched and t2.a = 1 then update * when
matched then delete when not matched then insert *;
+-------------+-------------+-------------+
| insert_rows | update_rows | delete_rows |
+-------------+-------------+-------------+
|           1 |           1 |           1 |
+-------------+-------------+-------------+

If you want to learn more, please feel free to contact the Databend team, or check out the resources listed below.

PR #13903 | feat: add merge status

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Adding Custom Clippy Rules with clippy.toml

rust-clippy is an official code linting tool provided by Rust. It uses static analysis to discover issues or non-compliant code.

By configuring the clippy.toml file, you can specify project-specific Clippy rules to enforce consistent code development practices and provide best practice guidelines.

For example, the following configuration prompts developers to use std::sync::LazyLock instead of lazy_static::lazy_static:

disallowed-macros = [
    { path = "lazy_static::lazy_static", reason = "Please use `std::sync::LazyLock` instead." },
]

PR #13926 | chore(query): remove extra error code construction

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Introduced json_path_match function and @?, @@ operators.
Switched to volo thrift as the replacement for upstream thrift, which is no longer actively maintained.
Read the documentation Docs | Tutorial: Dashboarding Covid-19 Data from New York Times to learn how to create and manage visual charts using the Databend Cloud Dashboard.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Adding Support for More Data File Types to `INFER_SCHEMA`

Databend supports the infer_schema table function, which infers the schema of data files for easy data loading and analysis.

See an example:

SELECT * FROM INFER_SCHEMA(location => '@infer_parquet/data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet');
+-------------+-----------------+----------+----------+
| column_name | type            | nullable | order_id |
+-------------+-----------------+----------+----------+
| number      | BIGINT UNSIGNED |        0 |        0 |
+-------------+-----------------+----------+----------+

Currently, infer_schema only works with Parquet files. We plan to extend its support to include other file types such as CSV and JSON.

Issue #13959 | INFER_SCHEMA supports more file types

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.239-nightly...v1.2.248-nightly

This Week in Databend #122

What's New

Stay informed about the latest features of Databend.

Supporting Chain Function Calls

Databend now offers an experimental SQL dialect to support chain function calls.

To try out this feature, run the command below to enable it first:

set sql_dialect = 'experimental';

Here is a simple example:

🐳 :) with t(f) as (select '11|open|22|ai|33|is nothing without sam'.split('|')
        .array_filter(x -> try_cast(x as int) is not null ).array_apply(x -> x::int ).array_apply(x -> x::int64 ))
        select f.array_reduce('sum'), f.array_reduce('max'), f.array_reduce('min')  from t;
┌──────────────────────────────────────────────────────────────────────────┐
│ array_reduce(f, 'sum') │ array_reduce(f, 'max') │ array_reduce(f, 'min') │
│     Nullable(Int64)    │     Nullable(Int64)    │     Nullable(Int64)    │
├────────────────────────┼────────────────────────┼────────────────────────┤
│                     66 │                     33 │                     11 │
└──────────────────────────────────────────────────────────────────────────┘

If you want to learn more, please feel free to contact the Databend team, or check out the resources listed below.

PR #13799 | feat(query): add chain function support

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Load Balancing with Query ID and Nginx

Databend JDBC carries an HTTP Header named X-DATABEND-QUERY-ID with each request. If you want to distribute multiple requests from the same client to different backend instances, you can use this HTTP Header as the Hash Key for Nginx, to achieve load balancing.

The following Nginx configuration file demonstrates how to implement load balancing based on Query ID for two databend-query instances located at ports 8000 and 8009:

#user  nobody;
worker_processes  1;

#error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;

#pid        logs/nginx.pid;


events {
    worker_connections  1024;
}


http {
    log_format main '$http_x_databend_query_id "$time_local" $host "$request_method $request_uri $server_protocol" $status $bytes_sent "$http_referer" "$http_user_agent" $remote_port $upstream_addr $scheme $gzip_ratio $request_length $request_time $ssl_protocol "$upstream_response_time"';
    access_log /opt/homebrew/var/log/nginx/access.log main;
    map $http_x_query_id $backend {
        default backend1;
    }
    upstream backend1 {
        hash $http_x_databend_query_id consistent;
        server localhost:8000;
        server localhost:8009;
    }
    server {
        listen 8085;
        location / {
            proxy_pass http://$backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-DATABEND-QUERY-ID $http_x_databend_query_id;
        }
    }
}

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for the QUALIFY clause.
Connection can now be created with role_arn as access credentials.
Added a change in MERGE INTO which separates matched-only and insert-only Pipelines.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Supporting for Calculating JOIN ORDER Using Greedy Algorithm

Databend currently calculates JOIN ORDER using DPhpy. If too much time is spent in this process, it is necessary to switch from dynamic programming algorithm to greedy algorithm to minimize the overall query time.

Issue #13884 | Feature: Support join order greedy algorithm

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.226-nightly...v1.2.239-nightly

This Week in Databend #121

What's New

Stay informed about the latest features of Databend.

Supporting Append-only Stream

A Databend stream currently supports Append-only mode, allowing it to capture data insertions. This encompasses potential data deletions or updates for the recently added rows. However, the stream consistently reflects the final changes to the table since its creation. For example, if a row is added and subsequently updated with new values, the stream will accurately describe the changes based on the updated values.

CREATE STREAM [IF NOT EXISTS] [<database>.]<stream> 
  ON TABLE [<database>.]<table> 
    [<stream_point>] [COMMENT = '<string_literal>']

Here is a simple example:

databend> create table t(a int);
Query OK, 0 rows affected (0.07 sec)

databend> insert into t values(1);
Query OK, 1 row affected (0.06 sec)

databend> alter table t set options(change_tracking=true);
Query OK, 0 rows affected (0.06 sec)

databend> create stream s on table t;
Query OK, 0 rows affected (0.04 sec)

databend> insert into t values(2);
Query OK, 1 row affected (0.07 sec)

databend> select * from s;
+------+
| a    |
+------+
|    2 |
+------+
1 row in set (0.07 sec)
Read 2 rows, 59.00 B in 0.032 sec., 61.82 rows/sec., 1.78 KiB/sec.

If you want to learn more, please feel free to contact the Databend team, or check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Switching from parquet2 to parquet-rs

Due to issues with bug fixes and long-term maintenance for parquet2, Databend has switched the default Parquet Reader to arrow-rs/parquet.

This change, effective from v1.2.223-nightly, helps improve the stability of Parquet file reading and writing. Currently, Databend's Fuse engine still uses parquet2.

PR #13754 | feat: change default value of use_parquet2 to 0

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Introduced a new time function: current_timestamp.
Added support for new array functions: array_to_string and array_flatten.
Introduced a new table function called stream_status and enhanced the functionality of the existing fuse_encoding table function.
Added support for User-Defined Function (UDF) authorization management.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Supporting Iceberg and Delta Tables with Table Engine

Iceberg tables currently work with Iceberg directories only. By separating tables from directories, we could experiment with creating tables compatible with Iceberg or Delta engines using the specified syntax:

create table <ident> from <location> engine=iceberg|delta [connection_name="my_connection", <other args>]

Issue #13787 | support iceberg and delta with table engine

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.216-nightly...v1.2.226-nightly

This Week in Databend #120

What's New

Stay informed about the latest features of Databend.

Managing Storage Access with CONNECTION

Databend now offers preliminary support for CONNECTION, which stores credentials and access management entities for accessing external storage services, providing a convenient and consistent access control experience.

The following example creates a connection to store access credentials for S3 object storage and creates an external stage using that connection:

CREATE CONNECTION my_conn storage_type = 's3' access_key_id ='minioadmin' secret_access_key ='minioadmin' endpoint_url='${STORAGE_S3_ENDPOINT_URL}';
CREATE STAGE my_stage url= 's3://testbucket/admin/tempdata/' connection = (connection_name='my_conn');

If you want to learn more, please feel free to contact the Databend team, or check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

opensrv v0.5.0 Released

OpenSRV v0.5.0 was released this week, primarily focusing on updating upstream dependencies to address security audit issues. The code quality has also been further improved according to clippy's suggestions.

If you need MySQL or ClickHouse server protocol support for database projects, consider opensrv.

opensrv v0.5.0

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Introduced @> and <@ JSON operators.
Added support for evaluating UDF calls using the Async Transform Pipeline.
Added support for LATERAL FLATTEN.
Enhanced RBAC with the support for secondary roles.
The Databend documentation repository has been relocated to databend-docs.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Summary Support for Dry Run of VACCUUM TABLE

The VACUUM TABLE command helps optimize system performance by freeing up storage space through the permanent removal of historical data files from a table.

Databend provides a DRY RUN option for VACUUM TABLE, allowing you to list the candidate files to be removed before actually removing any data files.

We plan to enhance the DRY RUN option by adding a summary, allowing you to check the size of data files to be freed before vacuuming.

Issue #13674 | feat: vacuum table dry run summary

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.206-nightly...v1.2.216-nightly

This Week in Databend #119

What's New

Stay informed about the latest features of Databend.

User Case: Databend Cloud Helped AIGC Startup Reduce Cost

Databend Cloud is an easy-to-use, low-cost, high-performance next-gen big data analytics platform built on the open-source cloud-native data warehouse project Databend.

By using Databend Cloud for analytics, the startup reduced their user behavior log analysis costs to 1% of their previous solution.

Databend Cloud AIGC Solution

Databend Cloud did an excellent job assisting with user behavior analysis and providing insights into root causes.

Monthly analysis and dashboard costs dropped by 100x.
Raw data is stored in the customer's own object storage for further compliance check.
Databend Cloud's excellent capacities for handling semi-structured data and complex queries perfectly matched the customer's business requirements and facilitated the migration of their analytics jobs to Databend Cloud.

If you want to learn more, please feel free to contact the Databend team, or check out the resources listed below.

Blog | Databend Cloud Helped AIGC Startup Reduce Cost

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

databend-udf Released, Databend Cloud Officially Supports External Functions

Databend supports creating UDFs using programming languages you are familiar with to customize data processing workflows and integrate into data analysis workflows.

Currently, Databend has packaged and released the Python SDK databend-udf to PyPI, for users to easily define and implement Python-based UDFs. We will also provide SDK support in more languages in the future, stay tuned.

In addition, Databend Cloud now also supports external UDFs. If you want to try it out, please contact us by creating a ticket and provide the UDF server address.

For more information, please check out the resources below.

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Introduced translate function.
Introduced a new system table named task_history .
MERGE INTO now supports using table directly as a data source.
SHOW DROP TABLE now supports filtering and LIMIT options.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

New Function: CURRENT_TIMESTAMP

The CURRENT_TIMESTAMP function returns the current date and time, with an optional precision parameter to control the number of digits after the decimal point in the output.

SELECT CURRENT_TIMESTAMP(2);

+------------------------+
| CURRENT_TIMESTAMP(2)   |
|------------------------|
| 2023-11-12 10:23:18.35 |
+------------------------+

SELECT CURRENT_TIMESTAMP();

+-------------------------------+
| CURRENT_TIMESTAMP             |
|-------------------------------|
| 2023-11-12 10:25:27.358000000 |
+-------------------------------+

Issue #13689 | Feature: support function: CURRENT_TIMESTAMP

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.193-nightly...v1.2.206-nightly

This Week in Databend #118

What's New

Stay informed about the latest features of Databend.

MERGE INTO now Supports Distributed Execution

The MERGE INTO statement performs insert, update, or delete operations on rows in the target table based on specified conditions and matching criteria using data from the specified source.

MERGE INTO has been enhanced to support distributed execution. The physical plan for distributed execution is shown as follows.

Image: Databend Merge Into

Please contact the Databend team if you would like to learn more, or check out the resources below:

PR #13151 | feat: support distributed merge_into

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Incorporate Data Version Control into Your Analytics Workflow with LakeFS

LakeFS provides a solution for sharing and collaborating on data processing by empowering object storage with Git-like operations. It offers a versioning scheme that provides logically consistent views of data, meaningful branch names, and commit messages for modern data workflows, as well as integrated solutions for data and document unification.

Databend not only provides a distributed, elastic, and scalable query engine with high performance but also supports multi-version storage similar to Git. This allows for easy querying, cloning, and restoring of data at any point in time.

By combining LakeFS's data versioning capabilities with Databend's Git-like data management capabilities, data integrity can be ensured, and efficient and highly available consistent views can be provided for data analysis workflows.

If you are interested in learning more, please check out the resources below:

Blog | Incorporate Data Version Control into Your Analytics Workflow

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Introduced new JSON operators: ?, ?|, and ?&.
Added support for permissions management on STAGE and UDF.
Introduced new table function FLATTEN.
Added support for creating external tables using URLs: CREATE STAGE test_stage [URL=] 's3://load/files/'.
Read the documentation Example 2: Attaching Table in READ_ONLY Mode to learn how to use ATTACH TABLE in read-only mode.
Read the documentation Conversion Functions to learn about type conversion functions and rules in Databend.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support for Reading Delta Lake

After successfully incorporating support for mounting and reading data from Iceberg tables through the Catalog, Databend is gearing up for adding compatibility with Delta Lake.

Delta Lake, initiated and open-sourced by Databricks, is a storage format that extends Parquet data files with file-based transaction logs, enabling ACID transactions and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs and tightly integrated with structured streaming, allowing you to easily use a single data copy in both batch and streaming operations while providing large-scale incremental processing.

Issue #13429 | add read only support for Delta table

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@guojidan added new filter parameters for the SHOW commands in #13510.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.183-nightly...v1.2.193-nightly

This Week in Databend #117

What's New

Stay informed about the latest features of Databend.

ATTACH TABLE READ_ONLY

Maintaining extensive resources for just a few large-scale queries can lead to high maintenance costs for enterprises. How about we combine local and cloud workloads, harness Databend Cloud's instant scaling computing capabilities, all while avoiding the additional storage costs associated with cloud usage?

Databend recently introduced the READ_ONLY argument to ATTACH TABLE**, allowing users with locally deployed Databend to share table data with Databend Cloud in read-only mode:

All table data is shared in read-only mode.
Changes made to a table, including data and schema changes, will be performed and tracked on the user's local platform.

ATTACH TABLE <target_table_name> '<source-table-data-URI>' CONNECTION=(<connection_parameters>) READ_ONLY

Adding the READ_ONLY parameter to the ATTACH TABLE statement will map the table data and schema using the latest snapshot hint file stored at the specified source table data URI, and the target table can only be accessed in a read-only manner.

Please contact the Databend team if you would like to learn more, or check out the resources below:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Replacing Jaeger Agent with OpenTelemetry Collector

Jaeger is an open source, end-to-end distributed tracing tool originating from Uber that helps monitor and troubleshoot microservices-based applications.

Due to port changes in the latest Jaeger All-in-One Docker image, the original distributed tracing capabilities became unavailable. Upstream is also considering deprecating opentelemetry-jaeger and encouraging users to migrate to opentelemetry-otlp, see Consider deprecating opentelemetry-jaeger #995.

We refactored the original trace export mechanism using opentelemetry-otlp while fixing the availability of distributed tracing capabilities. The corresponding environment variable also changes to DATABEND_OTEL_EXPORTER_OTLP_ENDPOINT.

If you are interested in learning more, please check out the resources below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added new aggregate function quantile_tdigest_weighted.
Recluster supports distributed execution.
DELETE supports table aliases.
SQLsmith supports REPLACE INTO and HINT. Read the latest blog Leveraging Randomized Testing with SQLsmith to Uncover Bugs to learn about Databend's SQLsmith design.
Added JSON operators ->, ->>, #> and #>>. See Docs | JSON Operators for details.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support for Defining CONNECTION

To more elegantly handle the mapping between role access control models in compute cluster instances and storage services, Databend plans to provide the ability to define and manage CONNECTIONs.

CONNECTION can hold identity credentials and access management (IAM) entities generated by external storage services, providing convenient and consistent access control experiences. Introducing CONNECTION can also simplify complex statements that includes a connection to object storages, such as creating an external stage, etc.

CREATE CONNECTION [IF NOT EXISTS]
    ENDPOINT_URL = 'http://localhost:9000',
    ACCESS_KEY_ID = 'ROOTUSER',
    SECRET_ACCESS_KEY = 'CHANGEME123'
...

Issue #13489 | feat: CREATE CONNECTION

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.174-nightly...v1.2.183-nightly

This Week in Databend #116

What's New

Stay informed about the latest features of Databend.

Feature Preview: Managing Background Tasks with SQL

Previously, Databend introduced Background Service to allow Databend to run one-off background jobs or cron jobs in daemon mode. This simplifies the complexity of managing data maintenance tasks.

To make it easy to create, manage and maintain background tasks, Databend recently added support for a series of SQL statements including CREATE TASK, ALTER TASK, SHOW TASK, etc.

For example, the following SQL statement instructs Databend to execute a task called MyTask1 every day at 6am PST to insert (1, 2) and (3, 4) into table t:

CREATE TASK IF NOT EXISTS MyTask1 SCHEDULE = USING CRON '0 6 * * *' 'America/Los_Angeles' COMMENT = 'serverless + cron' AS insert into t (c1, c2) values (1, 2), (3, 4) 

The Background Service requires Databend Enterprise Edition. Please contact the Databend team for upgrade information.

If you are interested in learning more, please check out the resources below:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Understanding Configuration Mapping in Databend

For a complex database service like Databend, supporting many configurable options helps developers manage and tune the system. A recently published blog explains the mappings between command line options, environment variables, and config files by reading the Databend code.

Databend currently supports three configuration methods in decreasing order of priority:

Command line options: Override configurations set elsewhere.
Environment Variables: Provide configuration flexibility for Kubernetes clusters.
Config files: A recommended approach for recording and managing configurations.

Please note that Databend also supports common environment variables from storage services such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY through the rollback mechanism in opendal. You can directly utilize these variables when working with Databend.

If you are interested in learning more, please check out the resources below:

Blog | Navigating Databend's Configuration Maze: A Guide for Developers and Operators

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added a new table function fuse_encoding .
Added new string functions split and split_part.
SQLsmith now supports for MERGE INTO.
databend-metactl supports viewing cluster status.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support More Type Comparisons in z3 Solver

Previously, Databend introduced z3.rs to solve constraints by finding solutions that satisfy given conditions. Z3 from Microsoft Research is a theorem prover commonly used to solve SMT problems. However, currently Databend's z3 solving only supports integer comparisons and needs to cover more types.

For example, after string comparison is supported, t3.f >= '2000-08-23' in select * from t1 left join t3 on t3.e = t1.a where t3.f >= '2000-08-23'; can be pushed down to table t3.

Issue #13236 | Feature: z3 supports more type comparison

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.160-nightly...v1.2.174-nightly

This Week in Databend #115

What's New

Stay informed about the latest features of Databend.

AGGREGATING INDEX

Databend has recently introduced AGGREGATING INDEX to improve query performance, especially for aggregation queries involving MIN, MAX, and SUM. Aggregating Index leverage techniques like precomputing and storing query results separately to eliminate the need to scan the entire table, thus speeding up data retrieval.

In addition, this feature includes a refresh mechanism that allows you to update and persist the latest query results on demand, ensuring data accuracy and reliability by refreshing the results when needed. Databend recommends manually refreshing the aggregating index before executing relevant queries to retrieve the most up-to-date data; Databend Cloud supports auto-refreshing of aggregating index.

-- Create an aggregating index
CREATE AGGREGATING INDEX my_agg_index AS SELECT MIN(a), MAX(c) FROM agg;

-- Refresh the aggregating index
REFRESH AGGREGATING INDEX my_agg_index;

The AGGREGATING INDEX requires Databend Enterprise Edition. Please contact the Databend team for upgrade information.

If you are interested in learning more, please check out the resources below:

Docs | AGGREGATING INDEX

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Visualizing the MERGE INTO Pipeline

Databend recently implemented the MERGE INTO statement to provide more comprehensive data management capabilities. For those interested in how it works under the hood, check out the pipeline visualization of MERGE INTO below.

                                                                                                                                               +-------------------+
                                                                                        +-----------------------------+    output_port_row_id  |                   |
                                           +-----------------------+     Matched        |                             +------------------------>-ResizeProcessor(1)+---------------+
                                           |                       +---+--------------->|    MatchedSplitProcessor    |                        |                   |               |
                                           |                       |   |                |                             +----------+             +-------------------+               |
        +----------------------+           |                       +---+                +-----------------------------+          |                                                 |
        |   MergeIntoSource    +---------->|MergeIntoSplitProcessor|                                                       output_port_updated                                     |
        +----------------------+           |                       +---+                +-----------------------------+          |             +-------------------+               |
                                           |                       |   | NotMatched     |                             |          |             |                   |               |
                                           |                       +---+--------------->| MergeIntoNotMatchedProcessor+----------+------------->-ResizeProcessor(1)+-----------+   |
                                           +-----------------------+                    |                             |                        |                   |           |   |
                                                                                        +-----------------------------+                        +-------------------+           |   |
                                                                                                                                                                               |   |
                                                                                                                                                                               |   |
                                                                                                                                                                               |   |
                                                                                                                                                                               |   |
                                                                                                                                                                               |   |
                                                                              +-------------------------------------------------+                                              |   |
                                                                              |                                                 |                                              |   |
                                                                              |                                                 |                                              |   |
          +--------------------------+        +-------------------------+     |         ++---------------------------+          |     +--------------------------------------+ |   |
+---------+ TransformSerializeSegment<--------+ TransformSerializeBlock <-----+---------+|TransformAddComputedColumns|<---------+-----+TransformResortAddOnWithoutSourceSchema<-+   |
|         +--------------------------+        +-------------------------+     |         ++---------------------------+          |     +--------------------------------------+     |
|                                                                             |                                                 |                                                  |
|                                                                             |                                                 |                                                  |
|                                                                             |                                                 |                                                  |
|                                                                             |                                                 |                                                  |
|          +---------------+                 +------------------------------+ |               ++---------------+                |               +---------------+                  |
+----------+ TransformDummy|<----------------+ AsyncAccumulatingTransformer <-+---------------+|TransformDummy |<---------------+---------------+TransformDummy <------------------+
|          +---------------+                 +------------------------------+ |               ++---------------+                |               +---------------+
|                                                                             |                                                 |
|                                                                             |  If it includes 'computed', this section        |
|                                                                             |  of code will be executed, otherwise it won't   |
|                                                                             |                                                 |
|                                                                            -+-------------------------------------------------+
|
|
|
|        +------------------+            +-----------------------+        +-----------+
+------->|ResizeProcessor(1)+----------->|TableMutationAggregator+------->|CommitSink |
         +------------------+            +-----------------------+        +-----------+

If you are interested in learning more, please check out the resources below:

PR #13036 | chore: draw merge into pipeline

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

MERGE INTO now supports for automatic recluster and compaction.
SQLsmith now covers DELETE, UPDATE, ALTER TABLE, and CAST.
Added semi-structured data processing functions json_each and json_array_elements.
Added time and date functions to_week_of_year and date_part. See Docs | Date & Time Functions for details.
Read Sending IoT Stream Data to Databend with LF Edge eKuiper to learn how Databend integrates with eKuiper to meet growing IoT data analytics demands.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Enhancing Role-Based Access Control

Currently, Databend's access control system consists of Role-Based Access Control (RBAC) and Discretionary Access Control (DAC). However, there is still room for improvement to make it more comprehensive.

We plan to support more privilege checks on uncovered resources and provide privilege definition guidance in 2023 Q4.

Issue #13207 | Tracking: RBAC improvement plan in 2023 Q4

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.147-nightly...v1.2.160-nightly

This Week in Databend #114

What's On In Databend

Stay connected with the latest news about Databend.

MERGE

The MERGE statement performs INSERT, UPDATE, or DELETE operations on rows within a target table, all in accordance with conditions and matching criteria specified within the statement, using data from a specified source.

A MERGE statement usually contains a MATCHED and / or a NOT MATCHED clause, instructing Databend on how to handle matched and unmatched scenarios. For a MATCHED clause, you have the option to choose between performing an UPDATE or DELETE operation on the target table. Conversely, in the case of a NOT MATCHED clause, the available choice is INSERT.

-- Merge data into 'salaries' based on employee details from 'employees'
MERGE INTO salaries
USING (SELECT * FROM employees)
ON salaries.employee_id = employees.employee_id
WHEN MATCHED AND employees.department = 'HR' THEN
    UPDATE SET
        salaries.salary = salaries.salary + 1000.00
WHEN MATCHED THEN
    UPDATE SET
        salaries.salary = salaries.salary + 500.00
WHEN NOT MATCHED THEN
    INSERT (employee_id, salary)
    VALUES (employees.employee_id, 55000.00);

MERGE is currently in an experimental state. Before using the MERGE command, you need to run SET enable_experimental_merge_into = 1; to enable the feature.

If you are interested in learning more, please check out the resources below:

Docs | MERGE

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Introducing `DATABEND_DATA_PATH` to Python Binding and Local Mode

Databend's local mode now allows users to control the storage location of metadata and data files by setting the DATABEND_DATA_PATH environment variable.

DATABEND_DATA_PATH=/tmp/data/ databend-query local -q "create table abc(a int); insert into abc values(3);"

DATABEND_DATA_PATH also works with Databend Python Binding, but it must be defined before using databend.

import os

os.environ["DATABEND_DATA_PATH"] = "/tmp/def/"

from databend import SessionContext

If you are interested in learning more, please check out the resources below:

PR #13089 | feat: support DATABEND_DATA_PATH in bendpy and databend-local

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Improved Hash Join, resulting in a 10% performance improvement in certain scenarios.
Enhanced distributed execution of MERGE.
Improved CI by using quickinstall to install relevant binary tools and executing unit tests with nextest.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Delete Files When Dropping Internal Stage

In Databend, an Internal Stage stores data files in the storage backend specified in databend-query.toml.

Considering that users will not be able to access the staged files after dropping an Internal Stage, it is necessary to remove the staged files when dropping the Internal Stage.

Issue #12986 | remove files at the same time of drop internal stage

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.137-nightly...v1.2.147-nightly

This Week in Databend #113

What's On In Databend

Stay connected with the latest news about Databend.

Loading to Table with Extra Columns

By default, COPY INTO loads data into a table by matching the order of fields in the file to the corresponding columns in the table. It's essential to ensure that the data aligns correctly between the file and the table.

load extra

If your table has more columns than the file, you can specify the columns into which you want to load data.

When working with CSV format, if your table has more columns than the file and the additional columns are at the end of the table, you can load data using the FILE_FORMAT option ERROR_ON_COLUMN_COUNT_MISMATCH.

If you are interested in learning more, please check out the resources below:

Docs | Example 5: Loading to Table with Extra Columns

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Introducing Read Policies to Parquet Reader

There is a drawback of using the arrow-rs APIs: When we try to prefetch data for prewhere and topk push downs, we can't reuse the deserialized blocks.

In order to improve the logic of row group reading and reuse the prefetched data blocks at the final output stage, we have done a lot of refactoring and introduced some read policies.

NoPrefetchPoliy

No prefetch stage at all. Databend reads, deserializes, and outputs data blocks you need directly.

PredicateAndTopkPolicy

Databend prefetches columns needed by prewhere and topk at the prefetch stage. It deserializes them into a DataBlock and evaluates them into RowSelection. It then slices the DataBlock by batch size and stores the resulting VecDeque in memory.

Databend reads the remaining columns specified by RowSelection at the final stage, and it outputs DataBlocks in batches. Then, it merges the prefetched blocks and projects the resulting blocks according to the output_schema.

TopkOnlyPolicy

It's similar to the PredicateAndTopkPolicy, but Databend only evaluates the topk at the prefech stage.

If you are interested in learning more, please check out the resources below:

PR #13020 | refactor: introduce ReadPolicy to parquet reader

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added spill info to query log.
Added support for unloading data into a compressed file with COPY INTO.
Introduced the GET /v1/background/:tenant/background_tasks HTTP API for querying background tasks.
Read Example 4: Filtering Files with Pattern to understand how to use Pattern to filter files.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Fixing issues detected by SQLsmith

In the last month, SQLsmith has discovered around 40 bugs in Databend. Databend Labs is actively working to fix these issues and improve system stability, even in uncommon scenarios. Your involvement in this effort, which may include tasks like type conversion or handling special values, is encouraged and can be facilitated by referring to past fixes.

Issues | Found by SQLsmith

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@zenus fixed an issue where schema mismatch was not detected during the execution of COPY INTO in #13010.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.128-nightly...v1.2.137-nightly

This Week in Databend #112

What's On In Databend

Stay connected with the latest news about Databend.

Understanding User-Defined Function

User-Defined Functions (UDFs) enable you to define their own custom operations to process data within Databend, enabling better data handling, task execution, and the creation of more efficient data workflows.

User-Defined Functions (UDFs) are typically written using lambda expressions or implemented via a UDF server with programming languages such as Python and are executed as part of Databend's query processing pipeline. Advantages of using UDFs include:

Customized Data Transformations.
Performance Optimization.
Code Reusability.

If you are interested in learning more, please check out the resources listed below.

Docs | User-Defined Function

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Refactoring Databend Metrics Component with Prometheus

metrics-rs is a general abstraction layer across different metrics solutions like statsd, prometheus, new-relic, etc. However, metrics-rs is not well-suited for measuring metrics about histogram.

Nowadays prometheus is the de facto standard in the metrics area. Using a raw Prometheus client without pushing metrics to other systems like StatsD offers several benefits:

Improved Performance: By adhering to best practices for metric handling, memory allocation for metrics can be optimized to achieve O(1) complexity, eliminating the need for local buffered queues.
Minimizing Abstraction Layers: Streamlining metric handling helps reduce unnecessary abstraction layers, simplifying the code path for improved clarity and comprehension.
Improved Coding Standards: Addressing the existing variance in metric practices across different modules by adopting Prometheus community standards can help establish a more consistent and cohesive approach to metrics within our codebase.

Now, Databend's observability metrics have been fully migrated to Prometheus implementation. This brings you a more comprehensive and reliable observability experience while keeping the original metrics almost unchanged.

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for COMPACT distributed execution.
Added the json_path_exists function.
Added the recluster_block_size setting to control the block size during re-clustering.
Added support for conversion from DECIMAL type to INT type.
Added support for an inverted filter to optimize filter execution, leading to a performance boost of up to 4 times in certain scenarios.
SQLSmith testing now supports generating table functions, window functions, subqueries, and the WITH clause.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Implementing the `GREATEST` Function

The GREATEST function returns the maximum value in a list of expressions.

Syntax:

GREATEST( <expr1> [ , <expr2> ... ] )

Example:

SELECT id, name, category, price, rating,
       CASE
         WHEN rating = 1 THEN 0.02
         WHEN rating = 2 THEN 0.04
         WHEN rating = 3 THEN 0.06
         WHEN rating = 4 THEN 0.08
         ELSE 0.1
       END AS increase_percentage_based_on_rating,
       rank() OVER (PARTITION BY category ORDER BY rating) AS rating_rank,
       CASE
         WHEN rating_rank = 1 THEN 0.2
         WHEN rating_rank = 2 THEN 0.1
         ELSE 0
       END AS increase_percentage_based_on_rank,
       GREATEST(increase_percentage_based_on_rating, 
                increase_percentage_based_on_rank) AS final_increase_percentage,
       CAST(price * (1 + final_increase_percentage) AS DECIMAL(10, 2))
         AS adjusted_price
FROM products

Issue #12944 | feat: GREATEST function

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.116-nightly...v1.2.128-nightly

This Week in Databend #111

What's On In Databend

Stay connected with the latest news about Databend.

"SHARE ENDPOINT" is a concept in Databend data sharing, as it defines the endpoint and tenant name of a data sharing instance. Consumers of the shared data must define a "SHARE ENDPOINT" in order to access the shared data.

For example, if Tenant A shares data with Tenant B, then Tenant B needs to create a corresponding SHARE ENDPOINT so that the instance where Tenant B is located can access the shared data.

CREATE SHARE ENDPOINT IF NOT EXISTS from_TenantA
    URL = '<share_endpoint_url>'
    TENANT = A
    COMMENT = 'Share endpoint to access data from Tenant A';

If you are interested in learning more, please check out the resources listed below.

Docs | SHARE ENDPOINT

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Using Python to create UDF

Databend now supports UDF Server, which allows users to implement more flexible and useful UDFs. By combining your preferred programming language with the Apache Arrow Flight API, you can interact with Databend and expand the capabilities of the database for richer and more efficient data workflows.

To enable this feature, you need to use version v1.2.116-nightly or later and enable UDF Server support in the configuration of databend-query:

[query]
...
enable_udf_server = true
# use your udf server address here
udf_server_allow_list = ['http://0.0.0.0:8815']
...

The following code snippet demonstrates how to create a UDF Server in Python for calculating the greatest common divisor.

from udf import *

# Define a function that accepts nullable values, and set skip_null to True to enable it returns NULL if any argument is NULL.
@udf(
    input_types=["INT", "INT"],
    result_type="INT",
    skip_null=True,
)
def gcd(x: int, y: int) -> int:
    while y != 0:
        (x, y) = (y, x % y)
    return x

if __name__ == '__main__':
    # create a UDF server listening at '0.0.0.0:8815'
    server = UdfServer("0.0.0.0:8815")
    # add defined functions
    server.add_function(gcd)
    # start the UDF server
    server.serve()

In Databend, you can register the UDF using the following SQL statement.

CREATE FUNCTION gcd (INT, INT) RETURNS INT LANGUAGE python HANDLER = 'gcd' ADDRESS = 'http://0.0.0.0:8815';

This feature is currently in the preview stage. We offer a Python SDK file for demonstration and testing purposes. More SDKs will be released in the future. Feel free to join the UDF ecosystem.

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for infer filter.
Added support for idempotent deletion.
Added capability of generating aggregate functions, scalar functions, and expressions for SQLSmith testing.
Read the documentation Docs | INSERT INTO and Docs | REPLACE INTO to learn how to use SQL statements to insert data from Stage into a table.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Caching Staged Parquet Files

Databend offers different types of caching to enhance query performance. These include Query Cache, File Metadata Cache, and Data Cache.

Currently, there exists an insufficient cache support mechanism for files residing in a stage. Introducing metadata caching or object caching for Parquet files within a stage could significantly boost query performance when working with external data.

select * from 's3://aa/bb/cc/' (pattern => '.*.parquet')

Issue #12762 | feat: add object cache for stage parquet file

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@nagarajatantry fixed a broken link in the documentation in #12836.
@ct20000901 fixed an issue where array_sort couldn't correctly handle empty arrays and NULL in #12827.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.109-nightly...v1.2.116-nightly

This Week in Databend #110

What's On In Databend

Stay connected with the latest news about Databend.

Using BendSQL to Manage Files in Stage

Databend recommends uploading files to a stage withPRESIGN or PUT / GET command. These methods enable direct file transfer between the client and your storage, eliminating intermediaries and resulting in cost savings by reducing traffic between Databend and your storage.

If you're using BendSQL to manage files in a stage, you can use the PUT command for uploading files and the GET command for downloading files.

root@localhost:8000/default> PUT fs:///books.parquet @~

PUT fs:///books.parquet @~

┌───────────────────────────────────────────────┐
│                 file                │  status │
│                String               │  String │
├─────────────────────────────────────┼─────────┤
│ /books.parquet                       │ SUCCESS │
└───────────────────────────────────────────────┘

GET @~/ fs:///fromStage/

┌─────────────────────────────────────────────────────────┐
│                      file                     │  status │
│                     String                    │  String │
├───────────────────────────────────────────────┼─────────┤
│ /fromStage/books.parquet                      │ SUCCESS │
└─────────────────────────────────────────────────────────┘

If you are interested in learning more, please check out the resources listed below.

Docs | Staging Files

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Using Databend Python Binding in Jupyter Notebook

Databend provides Python Binding, allowing you to access Databend's features without deploying a Databend instance. Its DataFrames can also be easily converted to the Polars and Pandas formats, facilitating integrations with common data science tools.

To use the Databend Python Binding, simply run the following command to install the library:

pip install databend

The code below demonstrates how to use Databend Python Binding in Jupyter Notebook and plot a bar chart using matplotlib.

# Create a table in Databend
ctx.sql("CREATE TABLE IF NOT EXISTS user (created_at Date, count Int32)")

# Create a table in Databend
ctx.sql("CREATE TABLE IF NOT EXISTS user (created_at Date, count Int32)")

# Insert multiple rows into the table
ctx.sql("INSERT INTO user VALUES ('2022-04-01', 5), ('2022-04-01', 3), ('2022-04-03', 4), ('2022-04-03', 1), ('2022-04-04', 10)")

# Execute a query
result = ctx.sql("SELECT created_at as date, count(*) as count FROM user GROUP BY created_at")

# Display the query result
result.show()

# Import libraries for data visualization
import matplotlib.pyplot as plt

# Convert the query result to a Pandas DataFrame
df = result.to_pandas()

# Create a bar chart to visualize the data
df.plot.bar(x='date', y='count')
plt.show()

If you are interested in learning more, please check out the resources listed below:

Docs | Tutorial-3: Integrate with Jupyter Notebook with Python Binding Library

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added initial support for ownership model.
Added support for Hash Join spill.
Columns in Databend are now nullable by default. Please refer to the documentation Docs | NULL Values and NOT NULL Constraint for more details.
Read the document Docs | databend-local to learn the Databend Local mode.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Creating UDFs in JavaScript

PR #12729 | feat: implement udf server in databend is expected to be merged this week. This means that Databend will soon support creating user-defined functions in Python.

CREATE FUNCTION [IF NOT EXISTS] <udf_name> (<arg_type>, ...) RETURNS <return_type> LANGUAGE <language> HANDLER=<handler> ADDRESS=<udf_server_address>

We will provide support for UDFs created in various languages, with JavaScript being the next one in line.

Issue #12746 | Feature: support javascript udf

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.96-nightly...v1.2.109-nightly

This Week in Databend #109

What's On In Databend

Stay connected with the latest news about Databend.

Understanding Cluster Key

By defining a Cluster Key, you can enhance query performance by clustering tables. In this case, data will be organized and grouped based on the Cluster Key, rather than relying solely on the order in which data is ingested. This optimizes the data retrieval logic for large tables and accelerates queries.

When using the COPY INTO or REPLACE INTO command to write data into a table that includes a cluster key, Databend will automatically initiate a re-clustering process, as well as a segment and block compact process.

Clustering or re-clustering a table takes time. Databend suggests defining cluster keys primarily for sizable tables that experience slow query performance.

If you are interested in learning more, please check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Exploring Databend Local Mode

Databend Local Mode aims to provide a simple Databend environment where users can directly interact with Databend using SQL without the need to perform a full Databend deployment. This makes it convenient for developers to perform basic data processing using Databend.

databend-query local starts a temporary databend-query process which functions as both client and server.The storage is located in a temporary directory and its lifecycle follows the process. Once the process ends, the resources are also destroyed. You can start multiple local processes on one server, and their resources are isolated from each other.

❯ alias databend-local="databend-query local"
❯ echo " select sum(a) from range(1, 100000) as t(a)" | databend-local
4999950000

❯ databend-local

databend-local:) select number %3 n, number %4 m, sum(number) from numbers(1000000000) group by n,m limit 3 ;

┌───────────────────────────────────┐
│   n   │   m   │    sum(number)    │
│ UInt8 │ UInt8 │    UInt64 NULL    │
├───────┼───────┼───────────────────┤
│     0 │     0 │ 41666666833333332 │
│     1 │     0 │ 41666666166666668 │
│     2 │     0 │ 41666666500000000 │
└───────────────────────────────────┘
0 row result in 1.669 sec. Processed 1 billion rows, 953.67 MiB (599.02 million rows/s, 4.46 GiB/s)

If you need to use Databend in a production environment, we recommend deploying a Databend cluster according to the official documentation or using Databend Cloud. databend-local is a good choice for developers to quickly leverage Databend's capabilities without the need to deploy a full Databend instance.

If you are interested in learning more, please check out the resources listed below:

PR | chore(query): improve databend local

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added initial support for MERGE INTO.
Implemented SQLsmith testing framework to support more accurate fuzzy testing.
Read the document Docs | Setting Environment Variables for how to change Databend configurations through environment variables.
Implemented json_strip_nulls and json_typeof functions. You can also read Docs | Semi-Structured Functions to check out Databend functions for semi-structured data processing.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Optimizing the implementation of MERGE INTO

In PR #12350 | feat: support Merge-Into V1, Databend has initially supported the MERGE INTO syntax.

Based on this, there are many optimizations worth considering, such as providing parallel and distributed implementations, reducing I/O, and simplifying data block splitting.

Issue #12595 | Feature: Merge Into Optimizations

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.87-nightly...v1.2.96-nightly

This Week in Databend #108

What's On In Databend

Stay connected with the latest news about Databend.

Multiple Catalogs

Catalog is a fundamental organizational concept in Databend and assists in efficiently managing and accessing your data sources.

Databend allows Multiple Catalogs and currently supports two types of external catalogs: Apache Iceberg and Apache Hive.

With external catalogs, there's no longer a need to load your external data into Databend before querying

-- Create a Hive catalog
CREATE CATALOG hive_ctl 
TYPE = HIVE 
CONNECTION =(
    METASTORE_ADDRESS = '127.0.0.1:9083' 
    URL = 's3://databend-toronto/' 
    AWS_KEY_ID = '<your_key_id>' 
    AWS_SECRET_KEY = '<your_secret_key>' 
);

If you are interested in learning more, please check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Using `cargo-pgo` to Practice PGO

Profile-guided optimization (PGO) is a common compilation optimization technique that utilizes profiling information collected during program runtime to guide the compiler in making optimizations, thereby improving program performance.

According to the tests, PGO boosts Databend performance by up to 10%. The actual performance depends on your workloads. Try PGO for your Databend cluster.

If you are interested in learning more, please check out the resources listed below:

Docs | Profile Guided Optimization (PGO)

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Introduced the json_pretty function.
Improved performance of Inner Join.
Added more metrics to the HTTP query interface.
Implemented support for SHOW DATABASES and SHOW TABLES in Hive Catalog.
Read Blog | Revolutionizing Data Archival and Query Performance for Pharmaceutical Group to check out how Databend played a pivotal role in transforming data archiving and query performance for a Pharmaceutical Group.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Using SQLsmith for Fuzz Testing

SQLsmith focuses on generating random, type-aware and column-aware SQL queries that can often pass semantic checks, thereby further testing the execution logic of databases.

In the past, Databend added support for SQLancer and traditional fuzz testing with LibAFL. The Databend team now plans to introduce SQLsmith for domain-aware fuzz testing, which will provide more comprehensive and targeted test results to increase the likelihood of discovering vulnerabilities.

Issue #12576 | Feature: Using sqlsmith to support sql fuzzy testing

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.74-nightly...v1.2.87-nightly

This Week in Databend #107

What's On In Databend

Stay connected with the latest news about Databend.

Understanding Connection Parameters

The connection parameters refer to a set of essential connection details required for establishing a secure link to supported external storage services, like Amazon S3. These parameters are enclosed within parentheses and consists of key-value pairs separated by commas or spaces. It is commonly utilized in operations such as creating a stage, copying data into Databend, and querying staged files from external sources.

For example, the following statement creates an external stage on Amazon S3 with the connection parameters:

CREATE STAGE my_s3_stage
's3://load/files/'
CONNECTION = (
    ACCESS_KEY_ID = '<your-access-key-id>',
    SECRET_ACCESS_KEY = '<your-secret-access-key>'
);

If you are interested in learning more, please check out the resources listed below.

Docs | SQL Reference - Connection Parameters

Adding Storage Parameters for Hive Catalog

Over the past week, Databend introduced storage parameters for the Hive Catalog, allowing the configuration of specific storage services. This means that the catalog no longer relies on the storage backend of the default catalog.

The following example shows how to create a Hive Catalog using MinIO as the underlying storage service:

CREATE CATALOG hive_ctl 
TYPE = HIVE 
CONNECTION =(
    ADDRESS = '127.0.0.1:9083' 
    URL = 's3://warehouse/' 
    AWS_KEY_ID = 'admin' 
    AWS_SECRET_KEY = 'password' 
    ENDPOINT_URL = 'http://localhost:9000/'
)

If you are interested in learning more, please check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Using `gitoxide` to Speed Up Git Dependency Downloads

gitoxide is a high-performance, modern Git implementation written in Rust. Utilizing the gitoxide feature of cargo (Unstable), the gitoxide crate can replace git2 to perform various Git operations, thereby achieving several times performance improvement when downloading crates-index and git dependencies.

Databend has recently enabled this feature for `cargo {build | clippy | test}`` in CI. You can also try to add the -Zgitoxide option to speed up the build process during local development:

cargo -Zgitoxide=fetch,shallow-index,shallow-deps build

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

VALUES clause can be used without being combined with SELECT.
You can now set a default value when modifying the type of a column. See Docs | ALTER TABLE COLUMN for details.
Databend can now automatically recluster a table after write operations such as COPY INTO and REPLACE INTO.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Enhancing `infer_schema` for All File Locations

Currently, it is possible to query files using file locations or from stages in Databend.

select * from 'fs:///home/...';
select * from 's3://bucket/...';
select * from @stage;

However, the infer_schema function only works with staged files. For example:

select * from infer_schema(location=>'@stage/...');

When attempting to use infer_schema with other file locations, it leads to a panic:

select * from infer_schema(location =>'fs:///home/...'); -- this will panic.

So, the improvement involves extending the infer_schema capability to encompass all types of file paths, not limited to staged files. This will enhance system consistency and the usefulness of the infer_schema function.

Issue #12458 | Feature: infer_schema support normal file path

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.62-nightly...v1.2.74-nightly

This Week in Databend #106

What's On In Databend

Stay connected with the latest news about Databend.

Masking Policy

A masking policy refers to rules and settings that control the display or access to sensitive data in a way that safeguards confidentiality while allowing authorized users to interact with the data. Databend enables you to define masking policies for displaying sensitive columns in a table, thus protecting confidential data while still permitting authorized roles to access specific parts of the data.

-- Create a masking policy
CREATE MASKING POLICY email_mask
AS
  (val string)
  RETURNS string ->
  CASE
  WHEN current_role() IN ('MANAGERS') THEN
    val
  ELSE
    '*********'
  END
  COMMENT = 'hide_email';

-- Associate the masking policy with the 'email' column
ALTER TABLE user_info MODIFY COLUMN email SET MASKING POLICY email_mask;

Masking Policy requires Enterprise Edition. To inquire about upgrading, please contact Databend Support.

If you are interested in learning more, please check out the resources listed below.

Docs | MASKING POLICY

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Adding `show()` Method to Python Binding

In Python packages such as PySpark, DuckDB, and DataFusion, the show() method is supported for outputting the first n rows of results.

Recently, Databend has also implemented corresponding support for Python binding through PyO3. The code snippet is as follows:

#[pyo3(signature = (num=20))]
fn show(&self, py: Python, num: usize) -> PyResult<()> {
    let blocks = self.collect(py)?;
    let bs = self.get_box();
    let result = blocks.box_render(num, bs.bs_max_width, bs.bs_max_width);

    // Note that println! does not print to the Python debug console and is not visible in notebooks for instance
    let print = py.import("builtins")?.getattr("print")?;
    print.call1((result,))?;
    Ok(())
}

If you are interested in learning more, please check out the resources listed below:

Issue #12255 | For Python compatibility, please add show() method

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for distributed REPLACE INTO.
Added support for the operator <-> to calculate the L2 norm (Euclidean norm) of a vector.
Added Geo functions: h3_to_center_child, h3_exact_edge_length_m, h3_exact_edge_length_km, h3_exact_edge_length_rads, h3_num_hexagons, h3_line, h3_distance, h3_hex_ring and h3_get_unidirectional_edge.
Read document Docs | ALTER TABLE COLUMN to learn how to modify a table by adding, converting, renaming, changing, or removing a column.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Adding Storage Backend Support for Hive Catalog

Previously, Databend's implementation of the Hive Catalog lacked support for configuring its own storage backend and could only fall back to the storage backend corresponding to the Default Catalog. This caused issues when the storage service pointed to by Hive MetaStore was inconsistent with the configuration of Default Catalog, resulting in an inability to read data.

Now there are plans to introduce CONNECTION options for Hive Catalog, allowing configuration of the storage backend and resolving issues with heterogeneous storage.

CREATE CATALOG hive_ctl
TYPE=HIVE
HMS_ADDRESS='127.0.0.1:9083'
CONNECTION=(
    URL='s3://warehouse/'
    AWS_KEY_ID='admin'
    AWS_SECRET_KEY='password'
    ENDPOINT_URL='http://localhost:9000'
);

Issue #12407 | Feature: Add storage support for Hive catalog

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@parkma99 made their first contribution in #12341. Fixed parsing issue of CREATE ROLE.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.51-nightly...v1.2.62-nightly

This Week in Databend #105

What's On In Databend

Stay connected with the latest news about Databend.

Loading Data with Debezium

Debezium is a set of distributed services to capture changes in your databases. debezium-server-databend is a lightweight CDC tool developed by Databend, based on Debezium Engine.

This tool provides a simple way to monitor and capture database changes, transforming them into consumable events without the need for large data infrastructures like Flink, Kafka, or Spark.

If you are interested in learning more, please check out the resources listed below.

Docs | Loading Data with Tools - Debezium

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Getting started with Iceberg using Databend

Apache Iceberg is a high-performance open table format designed for large-scale analytics workloads. It is known for its simplicity and reliability.

With Databend's Multiple Catalog capability and IceLake's Iceberg Rust implementation, Databend now supports mounting and analyzing data stored in Iceberg table format in the form of a Catalog.

CREATE CATALOG iceberg_ctl
TYPE=ICEBERG
CONNECTION=(
    URL='s3://warehouse/path/to/db'
    AWS_KEY_ID='admin'
    AWS_SECRET_KEY='password'
    ENDPOINT_URL='your-endpoint-url'
);

If you are interested in learning more, please check out the resources listed below:

Blog | Feature Preview: Iceberg Integration with Databend

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added Transport Layer Security (TLS) support for MySQL handler.
Added Geo functions: h3_to_string, string_to_h3, h3_is_res_class_iii, h3_is_pentagon, h3_get_faces, h3_cell_area_m2, and h3_cell_area_rads2.
Read document Docs | Network Policy to learn how to manage network policies in Databend.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Native `async fn` in Trait

Since the MVP of native async-fn-in-trait in Rust was launched in November 2022, it has been available for testing on Rust nightly channel. It may be time to evaluate the status of this feature and consider using it instead of async_trait.

#![feature(async_fn_in_trait)]

trait Database {
    async fn fetch_data(&self) -> String;
}

impl Database for MyDb {
    async fn fetch_data(&self) -> String { ... }
}

Issue #12201 | Refactor: use native async fn in trait syntax

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.42-nightly...v1.2.51-nightly

This Week in Databend #104

What's On In Databend

Stay connected with the latest news about Databend.

Loading Data with Kafka

Apache Kafka is an open-source distributed event streaming platform that allows you to publish and subscribe to streams of records.

Databend provides an efficient data ingestion tool (bend-ingest-kafka), specifically designed to load data from Kafka into Databend.

If you are interested in learning more, please check out the resources listed below.

Docs | Loading Data with Tools - Kafka

Loading Data with dbt

dbt is a transformation workflow that helps you get more work done while producing higher quality results.

dbt-databend-cloud is a plugin developed by Databend. By utilizing this plugin, you can seamlessly perform data modeling, transformation, and cleansing tasks using dbt and conveniently load the output into Databend.

If you are interested in learning more, please check out the resources listed below.

Docs | Loading Data with Tools - dbt

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Developing with Databend using Node.js

Databend now provides Databend Driver Node.js Binding, which means you can easily build and integrate applications with Databend using Node.js.

const { Client } = require('databend-driver');

const dsn = process.env.DATABEND_DSN
    ? process.env.DATABEND_DSN
    : "databend://user1:abc123@localhost:8000/default?sslmode=disable";

async function create_conn() {
    this.client = new Client(dsn);
    this.conn = await this.client.getConn();
    console.log('Connected to Databend Server!');
}

async function select_books() {
    var sql = "CREATE TABLE IF NOT EXISTS books(title VARCHAR, author VARCHAR, date VARCHAR)";
    await this.conn.exec(sql);
    console.log("Table created");

    var sql = "INSERT INTO books VALUES('Readings in Database Systems', 'Michael Stonebraker', '2004')";
    await this.conn.exec(sql);
    console.log("1 record inserted");

    var sql = "SELECT * FROM books";
    const rows = await this.conn.queryIter(sql);
    const ret = [];
    let row = await rows.next();
    while (row) {
        ret.push(row.values());
        row = await rows.next();
    }
    console.log(ret);
}

create_conn().then(conn => {
    select_books()
});

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for GROUP BY ALL.
Added the capability to speed up Common Table Expressions (CTEs) by materialization.
Added Geo functions: h3_to_geo, h3_to_geo_boundary, h3_k_ring, h3_is_valid,h3_get_resolution, h3_edge_length_m and h3_edge_length_km.
Added array lambda functions: array_transform, array_apply and array_filter.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Copying Files from One Stage to Another

Databend now supports analysis capabilities for over a dozen different storage service providers. This opens up more possibilities for data management, and copying files from one stage to another can be a beneficial starting point.

copy files from @mystage1/path/ to @mystage2;
copy files from @mystage1/path/f1.csv to @mystage2/archive/[f1.csv];

Issue #12200 | Feature: support copy files from one stage to other stage

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.32-nightly...v1.2.42-nightly

This Week in Databend #103

What's On In Databend

Stay connected with the latest news about Databend.

Creating Network Policies

Network Policies are used to manage access to Databend services. They can also be utilized to restrict access to a SQL user account based on users' IP addresses.

To create a set of network policies:

CREATE NETWORK POLICY <policy-name> ALLOWED_IP_LIST=(<allowed-ip>) BLOCKED_IP_LIST=(<blocked-ip>) COMMENT=<policy-comment>

To restrict access to a SQL user account:

CREATE USER <user-name> IDENTIFIED BY <user-password> WITH SET NETWORK POLICY=<policy-name>

--- OR ---

ALTER USER <user-name> WITH SET NETWORK POLICY=<policy-name>

If you are interested in learning more, please check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Profiling Table Scan

Databend now supports wait_time for profiling table scans. This feature offers insights into time spent reading data from storage, assisting in determining whether a query is I/O or CPU bound.

If you are interested in learning more, please check out the resources listed below:

PR #12158 | feat(planner): Support profile table scan

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added intersect_count and to_yyyymmddhh functions.
Added fuse_column system function.
Read document Docs | RECLUSTER TABLE to understand the updated reclustering mechanism.
Read documents Docs | Fuse Engine and Docs | ALTER TABLE OPTION to learn the options available in Fuse Engine and how to modify them.
Read document Docs | Vacuum Dropped Tables to learn about our latest enterprise feature and how it can help you save storage space.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Renaming TSV file format to Text

Databend supports a file format called TSV, which may mislead users into thinking that its only difference from CSV is the field delimiter.

In reality, TSV corresponds to the text format of MySQL/Hive/PostgreSQL and uses escaping instead of quoting to handle delimiters in values (MySQL text supports quoting but not by default). ClickHouse also uses it for transferring data to/from MySQL.

Therefore, we suggest renaming TSV to TEXT.

Issue #11987 | Rename TSV file format to Text

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@ben1009 made their first contribution in #12144. Fixed the typo in the error message.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.25-nightly...v1.2.32-nightly

This Week in Databend #102

What's On In Databend

Stay connected with the latest news about Databend.

Creating Bloom Indexes for Specified Columns

Creating bloom indexes consumes a significant amount of CPU resources. For wide tables, where only a few columns may require point queries or data ingestion performance is more important, creating bloom indexes for all columns may not be a good idea.

The bloom_index_columns option was introduced to the Databend Fuse Engine, allowing you to specify which columns you want to create bloom indexes for.

To create a table with bloom indexes:

CREATE TABLE table_name (
  column_name1 column_type1,
  column_name2 column_type2,
  ...
) ... bloom_index_columns='columnName1[, ...]'.

To create or modify bloom indexes for an existing table:

After modifying the Bloom index options, Databend will not create indexes for existing data. The changes will only affect the subsequent data.

ALTER TABLE <db.table_name> SET OPTIONS(bloom_index_columns='columnName1[, ...]');

To disable the bloom indexing:

ALTER TABLE <db.table_name> SET OPTIONS(bloom_index_columns='');

If you are interested in learning more, please check out the resources listed below.

PR #12048 | feat: support specify bloom index columns

Databend SQL Conformance

Databend aims to conform to the SQL standard, with particular support for ISO/IEC 9075:2011, also known as SQL:2011. Databend incorporates many features required by the SQL standard, often with slight differences in syntax or function.

We have summarized the level of conformity of Databend to the SQL:2011 standard, hoping it can help you further understand Databend's SQL Conformance.

If you are interested in learning more, please check out the resources listed below.

Docs | SQL Conformance

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Understanding Databend Recluster Pipeline

A well-clustered table may become chaotic in some storage blocks negatively affecting the query performance. For example, the table continues to have DML operations (INSERT / UPDATE / DELETE).

The re-clustering operation does not cluster the table from the ground up. It selects and reorganizes the most chaotic existing storage blocks by calculating based on the clustering algorithm.

The recluster pipeline is as follows:

┌──────────┐     ┌───────────────┐     ┌─────────┐
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┐
└──────────┘     └───────────────┘     └─────────┘    │
┌──────────┐     ┌───────────────┐     ┌─────────┐    │     ┌──────────────┐     ┌─────────┐
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┤────►│MultiSortMerge├────►│Resize(N)├───┐
└──────────┘     └───────────────┘     └─────────┘    │     └──────────────┘     └─────────┘   │
┌──────────┐     ┌───────────────┐     ┌─────────┐    │                                        │
│FuseSource├────►│CompoundBlockOp├────►│SortMerge├────┘                                        │
└──────────┘     └───────────────┘     └─────────┘                                             │
┌──────────────────────────────────────────────────────────────────────────────────────────────┘
│         ┌──────────────┐
│    ┌───►│SerializeBlock├───┐
│    │    └──────────────┘   │
│    │    ┌──────────────┐   │    ┌─────────┐    ┌────────────────┐     ┌─────────────────┐     ┌──────────┐
└───►│───►│SerializeBlock├───┤───►│Resize(1)├───►│SerializeSegment├────►│TableMutationAggr├────►│CommitSink│
     │    └──────────────┘   │    └─────────┘    └────────────────┘     └─────────────────┘     └──────────┘
     │    ┌──────────────┐   │
     └───►│SerializeBlock├───┘

If you are interested in learning more, please check out the resources listed below:

feat(storage): improve optimize and recluster

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for distributed COPY INTO.
Read document Docs | ATTACH TABLE to learn how to attach an existing table to another one.
Read documents Docs | Deepnote and Docs | MindsDB to learn how Databend can better collaborate with your data science projects.
Read documents Docs | Window Functions and Docs | Bitmap Functions to fully understand the BITMAP and window functions supported by Databend.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Accelerate CTE through Materialization

Inlining Common Table Expression (CTE) is a good idea, but if the CTE is particularly heavy, such as in TPCH Q15, the cost can be prohibitively expensive. In this case, it's better to introduce materialization for the expensive CTE.

--- TPCH Q15
WITH revenue AS
  (SELECT l_suppkey AS supplier_no,
          sum(l_extendedprice * (1 - l_discount)) AS total_revenue
   FROM lineitem
   WHERE l_shipdate >= TO_DATE ('1996-01-01')
     AND l_shipdate < TO_DATE ('1996-04-01')
   GROUP BY l_suppkey)
SELECT s_suppkey,
       s_name,
       s_address,
       s_phone,
       total_revenue
FROM supplier,
     revenue
WHERE s_suppkey = supplier_no
  AND total_revenue =
    (SELECT max(total_revenue)
     FROM revenue)
ORDER BY s_suppkey;

Issue #12067 | Feature: speed up CTE by materialization

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.14-nightly...v1.2.25-nightly

This Week in Databend #101

What's On In Databend

Stay connected with the latest news about Databend.

Tutorial | Create External Stage with AWS IAM User

AWS Identity and Access Management (IAM) allows you to control access to AWS services and resources by specifying authorized users or entities. It also enables centralized management of detailed permissions, as well as analysis of access patterns for refining permissions across the entire AWS platform.

Databend supports creating External Stage using AWS IAM User, making access to AWS S3 object storage more secure and reliable.

If you are interested in learning more, please check out the resources listed below.

Docs | Create External Stage with AWS IAM User

Tutorial | Create External Stage on Cloudflare R2

Cloudflare R2 gives you the freedom to create the multi-cloud architectures you desire with an S3-compatible object storage.

You can use Databend to create external storage on Cloudflare R2 and further efficiently mine the value of data.

If you are interested in learning more, please check out the resources listed below.

Docs | Create External Stage on Cloudflare R2

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

New System Table: `system.backtrace`

async-backtrace provides efficient and logical "stack" traces for asynchronous functions.

Databend has built the ability of async-backtrace into the system table. Developers can obtain trace information by querying the system.backtrace system table through SELECT statement, which will help further debugging and troubleshooting.

Furthermore, if you are in cluster mode, executing a query against system.backtrace on any node will return the call stack of the entire cluster.

If you are interested in learning more, please check out the resources listed below:

feat: add system.backtrace table

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added bitmap_not_count, bitmap_union and bitmap_intersect aggregate functions.
Added cume_dist window function.
Added support for ATTACH TABLE.
Added system.metrics table and metrics for measuring spill and transformation operations.
Added support for converting stored computed column into a regular column.
Implemented task suggestion for Serverless Background Service.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add Random Seed for Random Engine

Databend previously implemented a random engine that supports generating random tables using random data. In order to obtain deterministic and reproducible test results, we plan to introduce random seed.

CREATE table(t int, time timestamp) ENGINE=RANDOM(1000)

Issue #11863 | Feature: Support to add random seed on random engine

Please let us know if you're interested in contributing to this feature, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@dacozai made their first contribution in #11956. Updated dsn in develop/Rust docs.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.2.4-nightly...v1.2.14-nightly

This Week in Databend #100

For security reasons, the Root user is no longer available out of the box. You must configure it before use. Learn more at https://docs.databend.com/doc/sql-clients/admin-users .

What's On In Databend

Stay connected with the latest news about Databend.

Announcing Databend v1.2! Data + AI

Databend v1.2 was officially released on June 29, 2023! Thanks to all the community partners who participated and to everyone who contributed to making Databend better!

New Data Type: BITMAP
Direct Query of CSV/TSV/NDJSON Files Using Column Position
New Hash Table: Improved Hash Join Performance
AI Functions
Computed Columns
VACUUM TABLE
Serverless Background Service
Bind databend into Python
BendSQL - Databend Native Command Line Tool
Integration with Apache DolphinScheduler, Apache Flink CDC and Tableau

If you are interested in learning more, please check out the resources listed below.

What's Fresh in Databend v1.2 | Blog | Databend

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Databend Long Run Tests

Databend's long run tests the correctness and performance of the system under heavy load and concurrency. This includes concurrent large-scale data ingestion, table maintenance (optimization, re-clustering, and vacuuming), as well as querying.

The test will run a series of SQL and validation commands to verify the results. It will begin by executing the pre-test scripts (_before.sh), followed by repeatedly running concurrent test scripts, and finally executing post-test scripts (_after.sh). All event logs will be stored in a table on Databend for further analysis.

Databend conducts long run tests to verify the correctness and performance of the system under heavy load and concurrency. These tests involve concurrent ingestion of large-scale data, table maintenance (optimization, re-clustering, and vacuuming), as well as querying.

During the testing process, a series of SQL commands and validation checks will be performed to ensure accurate results. The testing process will start by running pre-test scripts (_before.sh), followed by repeated execution of concurrent test scripts, and finally executing post-test scripts (_after.sh). All event logs will be stored in a Databend table for further analysis.

                      +-------------------+
                      |     Long Run      |
                      +-------------------+
                               |
                               |
                               v
                  +-----------------------+
                  |  Before Test Scripts  |
                  +-----------------------+
                               |
                               |
                               v
        +----------------------------------+
        |     Concurrent Test Scripts      |
        +----------------------------------+
        |              |                   |
        |              |                   |
        v              v                   v
+----------------+ +----------------+ +----------------+
|  Test Script 1 | |  Test Script 2 | |  Test Script 3 |
+----------------+ +----------------+ +----------------+
                               |
                               |
                               v
              +-----------------------+
              |   After Test Scripts  |
              +-----------------------+

If you are interested in learning more, please check out the resources listed below:

Databend Long Run Tests

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added more execution information in system.query_profile, which makes it easier than ever to profile your queries.
Added the basic read support for Iceberg table.
Added support for ntile window function.
Added support for distributed copy into (first version).
Read documents Docs | Loading Data with Tools - Addax and Docs | Loading Data with Tools - DataX to learn how to import data efficiently and conveniently.
Read document Docs | Working with Stages - Staging Files to learn how to use the Presigned URL method for uploading files to the stage.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Release Proposal: Nightly v1.3

Databend v1.3 is scheduled for release on August 1st and will primarily focus on enhancing stability.

Task	Status
(Query) JSON indexing#6994	IN PROGRESS
(Query+Storage) Create index feature	IN PROGRESS
(Query+Storage)Distributed COPY#8594	IN PROGRESS
(Query+Storage) Distributed REPLACE	PLAN
COPY returns more status	PLAN
(Query+Storage) Query apache/iceberg	IN PROGRESS
(Processor) OrderBy Spill	IN PROGRESS
(Stability) Fast update/delete with fuse engine	IN PROGRESS
(Stability) Query profiling	IN PROGRESS
(Test) Longrun framework:BendRun	IN PROGRESS

Issue #11868 | Release proposal: Nightly v1.3

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.72-nightly...v1.2.4-nightly

This Week in Databend #99

What's On In Databend

Stay connected with the latest news about Databend.

Flink CDC

Databend now supports Flink CDC (Change Data Capture), which allows you to capture and process real-time data changes from various sources using SQL-based queries. With Flink CDC, you can monitor and capture data modifications (inserts, updates, and deletes) happening in a database or streaming system and react to those changes in real-time.

Databend's Flink SQL connector offers a connector that integrates Flink's stream processing capabilities with Databend. By configuring this connector, you can capture data changes from various databases as streams and load them into Databend for processing and analysis in real-time.

If you are interested in learning more, please check out the resources listed below:

Docs | Loading Data with Tools - Flink CDC

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Register External Tables with Databend's Python Binding

Databend's Python binding now allows the registration of external tables. You can use the following functions to register external tables in Databend:

register_parquet
register_ndjson
register_csv
register_tsv

Here is an example of how to use the register_parquet function to register an external table named "ontime" in Databend:

from databend import SessionContext

ctx = SessionContext()

ctx.register_parquet("ontime", "./ontime/", pattern = ".*.parquet")

df = ctx.sql("select * from ontime limit 10").collect()

print(df)

This code registers Parquet files located in the ./ontime/ directory with the pattern .*.parquet . You can then use the registered table name ontime in SQL queries to access the data.

If you are interested in learning more, please check out the resources listed below:

PR | feat(python): support register table functions

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for renaming columns with ALTER TABLE [ IF EXISTS ] <name> RENAME COLUMN <col_name> TO <new_col_name>.
Added support for using column position when querying CSV and TSV files.
Added system.background_jobs and system.background_tasks tables.
Added http query deduplication via X-DATABEND-DEDUPLICATE-LABEL header.
Added support for distributed deletion.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Implement Read-Only Mode for Databend Clusters

Databend plans to introduce a Read-Only Mode for clusters. This will provide users with more control over resource allocation.

Introducing Read-Only Mode allows separate clusters for read and write operations, preventing accidental writes in the read cluster, avoiding data loss, and enhancing performance.

Issue #11836 | feat: read-only mode for Databend clusters

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.64-nightly...v1.1.72-nightly

This Week in Databend #98

What's On In Databend

Stay connected with the latest news about Databend.

Background Service

Databend Internal Storage FuseTable is similar to Apache Iceberg, a Log Structured Table that requires regular table compaction, re-clustering, and vacuuming to merge small data chunks. The process involves sorting the data by the cluster key or vacuuming unneeded branches.

Previously, different drivers were used for these implementations which added complexity to the infrastructure. Additional services had to be deployed and maintained to trigger driver events. To simplify this process, we have introduced a Background Service that allows Databend to run as a background one-shot job or daemon for running cron jobs. These jobs can trigger table maintenance tasks such as automatic compaction/vacuum/reclustering based on criteria without additional maintenance needed.

This implementation includes:

Complete metasrv schema definition and background_job and background_tasks.
APIs for updating and maintaining background_job and background_task state on meta-service.
Simplified job scheduler implementation which support one_shot, interval, cron job type.

Background Service requires Enterprise Edition. To inquire about upgrading, please contact Databend Support.

If you are interested in learning more, please check out the resources listed below:

PR | feat: implement Background service for databend

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

IceLake - Pure Rust Iceberg Implementation

Iceberg, an open table format for analytics, lacks a mature Rust binding, making integration with databases like Databend difficult. IceLake aims to fill this gap and build an open ecosystem that:

Users can read/write iceberg tables from ANY storage services like s3, gcs, azblob, hdfs and so on.
ANY Databases can integrate with icelake to facilitate reading and writing of iceberg tables.
Provides NATIVE support transmute between arrows.
Provides bindings so that other languages can work with iceberg tables powered by Rust core.

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for MERGE JOIN.
Add support for column position to CSV format.
Read Docs | Computed Columns to understand how to use computed columns and the trade-offs when choosing which type to adopt.
Read Docs | Subquery-Based Deletions to learn how to use subquery operators and comparison operators to achieve the desired deletion.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support `VALIDATION_MODE` for COPY INTO

We hope to support the VALIDATION_MODE for Databend's COPY INTO statement, which can be used to validate the data that needs to be loaded (but not actually loaded) and return results based on specified validation options.

RETURN_ERRORS: This mode validates the data and returns all errors.
RETURN_<number>_ROWS: This mode validates <number> rows of data. If there are no errors, it returns the loaded information. If there are any errors encountered, it will throw errors.

Issue #11582 | Feature: copy support VALIDATION_MODE

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@jonahgao made their first contribution in #11718. Fixed column types of MySQLClient.
@akoshchiy made their first contribution in #11783. Updated MACOSX_DEPLOYMENT_TARGET value.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.56-nightly...v1.1.64-nightly

This Week in Databend #97

What's On In Databend

Stay connected with the latest news about Databend.

Column Position

Databend now offers support for utilizing syntax like $N to represent column positions. For instance, $2 indicates the second column. Additionally, Databend allows the usage of column positions alongside column names in SQL statements. Here is a simple example:

CREATE TABLE IF NOT EXISTS t1(a int, b varchar);
INSERT INTO t1 values (1, 'a'), (2, 'b');
select $1, $2, a, b from t1;

┌─────────────────────────────────┐
│   $1  │   $2   │   a   │    b   │
│ Int32 │ String │ Int32 │ String │
├───────┼────────┼───────┼────────┤
│     1 │ a      │     1 │ a      │
│     2 │ b      │     2 │ b      │
└─────────────────────────────────┘

You can also use column positions when you SELECT FROM a staged NDJSON file. We are also actively working on extending this support to other formats. When using the COPY INTO statement to copy data from a stage, Databend matches the field names at the top level of the NDJSON file with the column names in the destination table, rather than relying on column positions.

SELECT $1 FROM @my_stage (FILE_FORMAT=>'ndjson')

COPY INTO my_table FROM (SELECT $1 SELECT @my_stage t) FILE_FORMAT = (type = NDJSON)

It is important to note that when using the SELECT statement for NDJSON in Databend, only $1 is allowed, representing the entire row and having the data type variant.

-- Select the entire row using column position:
SELECT $1 FROM @my_stage (FILE_FORMAT=>'ndjson')

--Select a specific field named "a" using column position:
SELECT $1:a FROM @my_stage (FILE_FORMAT=>'ndjson')

If you are interested in learning more, please check out the resources listed below:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Learn Databend Workflows - Typos Check

Databend now has a very complex workflow for handling code auditing, testing, benchmarking and release. Typos Check is undoubtedly the simplest part of it. Let's take a look at some of its contents together.

Like other workflows, we need to use actions/checkout to check out the code.

- uses: actions/checkout@v4
  with:
    clean: "true"

typos-cli is a source code spell checker that finds and corrects spelling mistakes in source code. It is fast enough to run on monorepos and has low false positives, making it suitable for use on PRs.

- uses: baptiste0928/cargo-install@v1
  with:
    crate: typos-cli
    args: --locked
    cache-key: typos-check

We use baptiste0928/cargo-install to install dependencies. It is essentially the same as using cargo install in your GitHub workflows. Additionally, it allows for automatic caching of resulting binaries to speed up subsequent builds.

- name: do typos check with typos-cli
  run: typos

One thing to note is that typos-cli is the name of the crate, but the corresponding executable binary name is typos.

If you are interested in learning more, please check out the resources listed below:

Workflows | typos.yml

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added support for distributed Top-N.
The lazy_topn_threshold setting is now active by default, with a default value of 1,000.
For enhanced security measures, the ability to change the password has been added to the root user.
Read Blog | Databend X Tableau to learn how to connect Databend for BI data analysis in Tableau.
Read Docs | Integrating Databend as a Sink for Vector to understand how to integrate Vector with Databend.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add a Deduplication Label Field to the Rest API

To ensure that data ingestion is idempotent, Databend now supports deduplication of DML through the use of a deduplication label. You can find more information on this feature at Docs | Setting Commands - SET_VAR.

To facilitate cross-language driver integration, we could add a REST API field for the label.

Issue #11710 | Feature: support to bring deduplication label on stage attachment api

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.55-nightly...v1.1.56-draft2

This Week in Databend #96

What's On In Databend

Stay connected with the latest news about Databend.

Virtual Column

One of the ways to optimize querying JSON internal fields is to create and generate virtual columns for these fields with the CREATE VIRTUAL COLCUM and GENERATE VIRTUAL COLUMN functions. This enables direct queries against the JSON internal fields through their corresponding virtual columns.

Databend now supports a series of syntax for creating, dropping, generating and altering virtual columns. Here is a simple example:

create table test (id int, val json);
insert into test values(1, '{"a":33,"b":44}'),(2, '{"a":55,"b":66}');
create virtual columns (val['a'], val['b']) for test;
generate virtual columns for test;

select val['a'], val['b'] from test;
+----------+----------+
| val['a'] | val['b'] |
+----------+----------+
| 33       | 44       |
| 55       | 66       |
+----------+----------+

If you are interested in learning more, please check out the resources listed below:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Interactive Documents Powered by Askbend

Do you remember AskBend? AskBend is a Rust project that spices up Databend by incorporating the power of OpenAI for a SQL-based knowledge base created from markdown files.

Now, we have integrated it into the official website of Databend, providing an interactive experience for document queries.

If you are interested in learning more, please check out the resources listed below:

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added expression support to Windows function.
Added range function & aliases in select list
Added support for deduplication of insertion/mutation/update based on label.
Added optimization hint support for COPY INTO.
Read Docs | Deploying a Databend Cluster to learn about the tips when manually deploying a Databend cluster.
Read Docs | Data Types - BITMAP and Docs | Bitmap Functions to understand the usage of the BITMAP data type in Databend.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Optimize Sorting Algorithm with Loser Tree

Currently, Databend's sorting algorithm uses MergeSort based on a heap.

There are three main algorithms for multiple merging: heap sort, winner tree, and loser tree. For all of them, each heap adjustment of the heap sorting requires comparison with the left and right child nodes. The number of comparisons is 2logN while the comparison times for winner tree and loser tree adjustments are logN.

The contrast lies in how the two trees operate. In a winner tree, each node must be compared with its sibling nodes and then update the parent node. On the other hand, in a loser tree, a node only requires comparison with the parent node, which leads to fewer instances of memory access.

Issue #11604 | Feature: Update sort algorithm using Loser Tree

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.51-nightly...v1.1.55-nightly

This Week in Databend #95

What's On In Databend

Stay connected with the latest news about Databend.

Native COS Support

Databend now offers native support for COS (Cloud Object Storage)! In comparison to the S3 compatibility provided by COS, our native support offers several advantages:

Fixed Signing Errors: We have addressed signing errors that occurred in certain boundary cases by supporting COS's native signature algorithm.
Enhanced Functionality: In the future, we plan to provide additional features such as append and other functions that are not supported by S3 but are available in COS.
Integrated Authentication: With COS's native support, we can seamlessly integrate with Tencent Cloud's RAM (Resource Access Management), STS (Security Token Service), and other services. This integration enables us to provide authentication capabilities without relying on static keys.

If you are interested in learning more, please check out the resources listed below:

PR | feat(storage): Add native COS support

IEJoin

Databend has recently introduced support for IEJoin (Inequality Join), which is commonly used in various applications to join relational tables based on inequality conditions.

IEJoin

IEJoin is a fast algorithm specifically designed for inequality joins. The method sorts the columns that need to be joined into an array and uses a permutation array to indicate the position of tuples in one sorted array relative to another. Unlike sorted merge join, IEJoin utilizes space-efficient bit arrays that allow optimizations such as Bloom filter indexing for faster computation of join results.

If you are interested in learning more, please check out the resources listed below:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Python Binding for Databend Rust Driver

PyO3 provides bindings between Rust and Python, along with a ready-to-use scaffolding tool called Maturin. With Maturin, it's easy to create Python extension modules based on Rust.

This allows for code reorganization where performance-critical parts can be rewritten in Rust while the rest of the code can still use original Python code.

Based on PyO3, we can focus more conveniently on the Rust implementation logic itself without paying too much attention to FFI (Foreign Function Interface) and conversion details. This makes it possible to convert Rust into a Python module at a lower cost, and only maintain one set of code, greatly reducing maintenance costs. Databend is implementing Python bindings for Rust drivers, with the ultimate goal of replacing the current databend-py.

Here are some code snippets that may help you understand how to use PyO3.

In asyncio.rs, a Struct called AsyncDatabendDriver can be defined as a Python class, and the Python module can be defined as databend-driver.

/// `AsyncDatabendDriver` is the entry for all public async API
#[pyclass(module = "databend_driver")]
pub struct AsyncDatabendDriver(Connector);

The corresponding methods for AsyncDatabendDriver are implemented next, with underlying calls made to the methods implemented in Rust's Trait. For example, in this code snippet, we see that exec is used:

#[pymethods]
impl AsyncDatabendDriver {
    #[new]
    #[pyo3(signature = (dsn))]
    pub fn new(dsn: &str) -> PyResult<Self> {
        Ok(AsyncDatabendDriver(build_connector(dsn)?))
    }

    /// exec
    pub fn exec<'p>(&'p self, py: Python<'p>, sql: String) -> PyResult<&'p PyAny> {
        let this = self.0.clone();
        future_into_py(py, async move {
            let res = this.connector.exec(&sql).await.unwrap();
            Ok(res)
        })
    }
}

Finally, add AsyncDatabendDriver as a Python class in lib.rs:

#[pymodule]
fn _databend_driver(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<AsyncDatabendDriver>()?;
    Ok(())
}

If you are interested in learning more, please check out the resources listed below:

PR | feat: Support python binding

Highlights

We have also made these improvements to Databend that we hope you will find helpful:

Added table lock for compact.
Added system.indexes and show indexes command.
Added support for azure openai api.
Added new bitmap-related functions: bitmap_and_count, bitmap_or_count, bitmap_xor_count, sub_bitmap, bitmap_subset_limit and bitmap_subset_in_range.
Added new window functions: lag, lead, first_value, last_value and nth_value.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support for Custom Character Sets

A character set is a set of rules that defines how characters are represented, while collation determines the order of characters within the character set. Character sets are used to store data and can contain multiple collations.

It would be great to have support for custom character sets in the same way as MySQL, e.g., IBM z/OS uses EBCDIC encoding by default.

Issue #11567 | Feature: Support custom character sets

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@ZhengLin-Li made their first contribution in #11563. Updated bendsql command.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.43-nightly...v1.1.51-nightly

This Week in Databend #94

What's On In Databend

Stay connected with the latest news about Databend.

Computed Columns

Computed columns are generated from other columns by a scalar expression. There are two types of computed columns: stored and virtual.

A stored computed column computes and stores the result value when a row is inserted. Use this SQL syntax to create one:

column_name <type> AS (<expr>) STORED

While a virtual computed column is calculated at query time and does not store the result value. To create one, use this SQL syntax:

column_name <type> AS (<expr>) VIRTUAL

PR | feat(query): Support Computed columns

VACUUM TABLE

The VACUUM TABLE command helps to optimize the system performance by freeing up storage space through the permanent removal of historical data files from a table. This includes:

Snapshots associated with the table, as well as their relevant segments and blocks.
Orphan files. Orphan files in Databend refer to snapshots, segments, and blocks that are no longer associated with the table. Orphan files might be generated from various operations and errors, such as during data backups and restores, and can take up valuable disk space and degrade the system performance over time.

VACUUM TABLE requires Enterprise Edition. To inquire about upgrading, please contact Databend Support.

If you are interested in learning more, please check out the resources listed below:

Docs | VACUUM TABLE

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Enable Cache in Python Binding

Databend supports data caching and query result caching, which can effectively accelerate queries. The Python bindings of Databend also support these features, albeit with slight differences.

For query result caching, SQL statements can be used to set it up, which is very convenient.

>>> from databend import SessionContext 
>>> ctx = SessionContext()
>>> ctx.sql("set enable_query_result_cache = 1")

For data caching, it can be enabled through environment variables.

>>> import os 
>>> os.environ["CACHE_DATA_CACHE_STORAGE"] = "disk"
>>> from databend import SessionContext 
>>> ctx = SessionContext()
>>> ctx.sql("select * from system.configs where name like '%data_cache%'")
┌────────────────────────────────────────────────────────────────────────────┐
│  group  │                   name                   │  value  │ description │
│  String │                  String                  │  String │    String   │
├─────────┼──────────────────────────────────────────┼─────────┼─────────────┤
│ 'cache' │ 'data_cache_storage'                     │ 'disk'  │ ''          │
│ 'cache' │ 'table_data_cache_population_queue_size' │ '65536' │ ''          │
└────────────────────────────────────────────────────────────────────────────┘

Feel free to use it in your data science workflow:

databend · PyPI

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Read Docs | Date & Time - Formatting Date and Time to learn how to precisely control the format of time and date.
Added support for transforming data when loading it from a URI.
Added support for replacing with stage attachment.
Added bitmap-related functions: bitmap_contains, bitmap_has_all, bitmap_has_any, bitmap_or, bitmap_and, bitmap_xor, etc.
Supported intdiv operator //.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Remove `if_not_exists` from the Meta Request

In CreateIndexReq/CreateTableReq, we use if_not_existed to indicate whether an index/table exists.

pub struct CreateIndexReq {
    pub if_not_exists: bool,
    pub name_ident: IndexNameIdent,
    pub meta: IndexMeta,
}

The if_not_exists clause only affects the outcome that is presented to the user, and does not alter the behavior of the meta-service operation.

Therefore, it will be more effective for SchemaApi to provide either a Created or an Exist status code, allowing the caller to determine whether to generate an error message.

Issue #11456 | Moving if_not_exists out of meta request body

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@silver-ymz made their first contribution in #11487. Added five bitmap-related functions.
@Jake-00 made their first contribution in #11503. Modified duplicate test case for SOUNDS LIKE syntax.
@gitccl made their first contribution in #11507. Added five bitmap-related functions and fixed panic when calling with empty bitmap.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.38-nightly...v1.1.43-nightly

This Week in Databend #93

The upgrade tool meta-upgrade-09 will no longer be available in the release package. If you're using Databend 0.9 or an earlier version, you can seek help from the community.

What's On In Databend

Stay connected with the latest news about Databend.

Databend's Segment Caching Mechanism Now Boasts Improved Memory Usage

Databend's segment caching mechanism has received a significant upgrade that reduces its memory usage to 1.5/1000 of the previous usage in a test scenario.

The upgrade involves a different "representation" of cached segments, called CompactSegmentInfo. This presentation consists mainly of two components:

The decoded min/max indexes and other statistical information.
The undecoded (and compressed) raw bytes of block-metas.

During segment pruning, if any segments are pruned, there is no need to decode the block-metas represented by raw bytes. If they are not pruned, then their raw bytes are decoded on-the-fly for block pruning and scanning purposes (and dropped if no longer needed).

If you are interested in learning more, please check out the resources listed below.

PR #11347 | refactor: hybrid segment cache

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Bind `databend` into Python

Databend now offers a Python binding that allows users to execute SQL queries against Databend using Python even without deploying a Databend instance.

To use this functionality, simply import SessionContext from databend module and create an instance of it:

from databend import SessionContext

ctx = SessionContext()

You can then run SQL queries using the sql() method on your session context object:

df = ctx.sql("select number, number + 1, number::String as number_p_1 from numbers(8)")

The resulting DataFrame can be converted to PyArrow or Pandas format using the to_py_arrow() or to_pandas() methods respectively:

df.to_pandas() # Or, df.to_py_arrow()

Feel free to integrate it with your data science workflow.

databend · PyPI

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Read the two new tutorials added to Transform Data During Load to learn how to perform arithmetic operations during loading and load data into a table with additional columns.
Read Working with Stages to gain a deeper understanding and learn how to manage and use it effectively.
Added functions: date_format, str_to_date and str_to_timestamp.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add `open-sharing` Binary to Databend Image

Open Sharing is a cheap and secure data sharing protocol for databend query on multi-cloud environments. Databend provides a binary called open-sharing, which is a tenant-level sharing endpoint. You can read databend | sharing-endpoint - README.md to learn more information.

To facilitate the deployment of open-sharing endpoint instances using K8s or Docker, it is recommended to add it to Databend's docker image.

Issue #11182 | Feature: added open-sharing binary in the databend-query image

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@Mehrbod2002 made their first contribution in #11367. Added validation for max_storage_io_requests.
@DongHaowen made their first contribution in #11362. Specified database in benchmark.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.30-nightly...v1.1.38-nightly

This Week in Databend #92

What's On In Databend

Stay connected with the latest news about Databend.

Using Column Aliases in a Where Clause

A column alias provides a way to create a clean or more descriptive header for a results set.

Databend now supports column aliases in WHERE clause. Please note that if the alias and the column name are the same, the WHERE clause will recognize the alias as the column name:

> select number * 2 as number from numbers(3) where (number + 1) % 3 = 0;

┌────────┐
│ number │
│ UInt64 │
├────────┤
│      4 │
└────────┘

If you are interested in learning more, please check out the resources listed below.

PR #11272 | feat: support alias in where clause

`databend-metactl` Is Included in the Databend Release

databend-metactl is a command-line tool that allows users to manage the Databend Meta Service cluster. It can be used to back up and restore the metadata.

Now, databend-metactl will be released together with Databend and no longer requires building it manually.

If you are interested in learning more, please check out the resources listed below.

Docs | Back Up and Restore Databend Meta Service Cluster

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Troubleshooting `databend-meta` Connection Issues with Invalid Semver Error

When compiling and running Databend, you may encounter connection issues with databend-meta and receive an error message containing Invalid semver. This is due to Databend's protocol compatibility check using semantic versions.

To resolve this issue, you can fetch the latest tags from the official Databend repository using the command git fetch https://github.com/datafuselabs/databend.git --tags. This ensures that your project is using the latest version of databend-meta and passes the version check.

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Optimizer now supports constant folding. If interested, please check PR #11216.
Learn how to Transform Data During Load through three simple tutorials.
Added bitmap functions: bitmap_count and build_bitmap.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Introduce Timeout Mechanism to Control the Query Optimization Time

Optimizing complex queries with numerous joins can be a time-consuming task. To avoid performing an exhaustive search of the entire query plan space, we can set a timeout for optimization.

The timeout can be based on a logical time, such as the number of applied transform rules, rather than wall time. Once the timeout is reached, a greedy search can be performed instead of generating new transform rules.

Issue #11133 | Introduce timeout mechanism to control the query optimization time

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@JackTan25 made their first contribution in #11290. The PR fixes table options validation.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.23-nightly...v1.1.30-nightly

This Week in Databend #91

What's On In Databend

Stay connected with the latest news about Databend.

New datatype: BITMAP

Databend has added support for the bitmap datatype.

BITMAP is a type of compressed data structure that can be used to efficiently store and manipulate sets of boolean values. It is often used to accelerate count distinct.

> CREATE TABLE IF NOT EXISTS t1(id Int, v Bitmap) Engine = Fuse;
> INSERT INTO t1 (id, v) VALUES(1, to_bitmap('0, 1')),(2, to_bitmap('1, 2')),(3, to_bitmap('3, 4'));
> SELECT id, to_string(v) FROM t1;

┌──────────────────────┐
│   id  │ to_string(v) │
│ Int32 │    String    │
├───────┼──────────────┤
│     1 │ 0,1          │
│     2 │ 1,2          │
│     3 │ 3,4          │
└──────────────────────┘

Our implementation of the BITMAP data type utilizes RoaringTreemap, a compressed bitmap with u64 values. Using this data structure brought us improved performance and decreased memory usage in comparison to alternative bitmap implementations.

If you are interested in learning more, please check out the resources listed below.

Improving Hash Join Performance with New Hash Table Design

We optimized our previous hash table implementation for aggregation functions, but it significantly limited hash join operation performance. To improve hash join performance, we implemented a dedicated hash table optimized for it. We allocated a fixed-size hash table based on the number of rows in the build stage and replaced the value type with a pointer that supports CAS operations, ensuring memory control without the need for Vec growth. The new implementation significantly improved performance. Check out the resources below for more information:

PR #11140 | feat(query): new hash table and parallel finalize for hash join

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Rust Compilation Challenges and Solutions

Compiling a medium to large Rust program is not a breeze due to the accumulation of complex project dependencies and boilerplate code.

To address these challenges, Databend team implemented several measures, including observability tools, configuration adjustments, caching, linker optimization, compile-related profiles, and refactoring.

If you are interested in learning more, please check out the resources listed below.

Optimizing Compilation for Databend

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Databend will participate in OSPP 2023 projects: OSPP2023 - Databend.
Check out Docs | Developing with Databend using Rust for Rust application development with databend-driver.
Learn to manage and query databases with ease using BendSQL, a powerful command-line tool for Databend. Check out Docs | BendSQL now!
Check out Docs | Loading from a Stage and Docs | Loading from a Bucket to learn more about loading data from stages and object storage buckets.
Introduced table-meta-inspector, a command-line tool for decoding new table metadata in Databend.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Contributors Wanted for Function Development

We are currently working on improving our functions, and we need your help!

We have identified four areas that require attention, and we would be extremely grateful for any assistance that you can provide.

If you are interested in contributing to any of these areas, please refer to the following resources to learn more about how to write scalar and aggregate functions:

We appreciate any help that you can provide, and we look forward to working with you.

Issue #11220 | Tracking: functions

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.14-nightly...v1.1.23-nightly

This Week in Databend #90

What's On In Databend

Stay connected with the latest news about Databend.

Table Meta Optimization

Databend meta files now default to V3, enhancing metadata serialization/deserialization efficiency & reducing storage space consumption. Witness a 20X size decrease & 2X performance boost!

The format begins with a header that stores the version number of the segment, followed by the encoding format (Bincode) and compression method, both of which are stored in an enumerated manner. The length of the serialized data (blocks and summary) is also included in the header. Finally, the data is serialized in sequence before being compressed and stored.

By using Bincode, the new metadata format is able to serialize data in a compact and efficient manner. This can significantly improve reading speed and reduce the size of stored files. Additionally, by using Zstd compression, the format is able to further reduce file size while maintaining fast decompression speeds.

If you are interested in learning more, please check out the resources listed below.

feat(query): table meta optimize

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Custom Linker Script for Static Linking

When it comes to compiling and linking code, static linking can be useful in certain scenarios, such as when deploying to systems with limited dependencies.

@dl00 and @bossmc suggested creating a custom linker script to address some of challenges and unexpected behaviors.

#!/bin/bash

args=()

for arg in "$@"; do
    if [[ $arg = *"Bdynamic"* ]]; then
        args+=() # we do not want this arg
    elif [[ $arg = *"crti.o"* ]]; then
        args+=("$arg" "-Bstatic")
    elif [[ $arg = *"crtn.o"* ]]; then
        args+=("-lgcc" "-lgcc_eh" "-lc" "/usr/local/lib/gcc/${MUSL_TARGET}/9.4.0/crtendS.o" "$arg")
    else
        args+=("$arg")
    fi
done

echo "RUNNING WITH ARGS: ${args[@]}"
/usr/local/bin/${MUSL_TARGET}-g++ "${args[@]}"

The script, written in bash, removes unnecessary arguments and adds necessary ones for static linking. While this solution may not be perfect, it worked and may be useful for others facing similar issues.

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Deploying Databend on Docker
Added support for table with cluster key definition for REPLACE INTO statement.
Added window function percent_rank.
Added JSON path functions: jsonb_path_query_first,jsonb_path_query,jsonb_path_query_array.
Added SOUNDS LIKE syntax for String comparing.
Split log store and state-machine store in Meta Service.
Lazy materialize according to virtual column _row_id.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Aggregate Indices

While approximate queries in the Data Warehouse can reuse cached data to reduce IO overhead, caches have limitations in terms of capacity and timeliness, and cannot solve the issue of incremental calculation.

To address these challenges, pre-aggregation technology in OLAP is a useful tool for reducing data redundancy. Materialized views are a great way to implement pre-aggregation. Databend plan to introduce aggregate indices as a means of implementing materialized views through secondary indices.

Issue #11183 | Tracking: aggregating index feature

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@neil4dong made their first contribution in #11043. The PR adds SOUNDS LIKE syntax for String comparing.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.1.2-nightly...v1.1.14-nightly

This Week in Databend #89

What's On In Databend

Stay connected with the latest news about Databend.

Announcing Databend v1.1

Databend v1.1.0 was officially released on April 14, 2023! This release marks the first major update of Databend since version 1.0.

The new version has many new features, including:

COPY INTO now has ETL capabilities
Full Support for TPC-DS Queries
REPLACE INTO
Window Functions
Integration with Metabase, Redash and Grafana
Rust driver
AI Functions
AskBend - knowledge base Q&A system

If you are interested in learning more, please check out the resources listed below.

What's Fresh in Databend v1.1 | Blog | Databend

Databend now offers additional configurations for AI-related tasks, allowing you the flexibility to choose from various endpoints and models.

This means that it can be easily integrated with other services including Azure OpenAI API. Additionally, the corresponding model can be flexibly configured as needed to balance performance and cost.

| "query"   | "openai_api_base_url"                      | "https://api.openai.com/v1/"     | ""       |
| "query"   | "openai_api_completion_model"              | "text-embedding-ada-002"         | ""       |
| "query"   | "openai_api_embedding_model"               | "gpt-3.5-turbo"                  | ""       |

If you are interested in learning more, please check out the resources listed below.

PR | add openai api_base_url, completion_model and embedding_model to query config

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

bendsql - Databend Native Command Line Tool

We previously introduced bendsql, which is the Databend native client written in Go. Now it is being rewritten in Rust and is currently in the experimental stage.

The Rust version not only provides support for REST API but also supports Flight SQL.

bendsql> select avg(number) from numbers(10);

SELECT
  avg(number)
FROM
  numbers(10);

┌───────────────────┐
│    avg(number)    │
│ Nullable(Float64) │
├───────────────────┤
│ 4.5               │
└───────────────────┘

1 row in 0.259 sec. Processed 10 rows, 10B (38.59 rows/s, 308B/s)

We are excited about the progress of the Rust version and look forward to sharing more updates with you soon! Feel free to try it out and give us feedback. For more information, follow the resources listed below.

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Databend leverages a RBAC model to secure your data. Read Docs | Access Control Privileges to learn about Global privileges and Object-specific privileges in Databend.
Added the array_aggregate function and support more array aggregate functions, including stddev_samp, stddev_pop, stddev, std, median, approx_count_distinct, kurtosis, and skewness.
Added the system table system.caches to describe the cache status in Databend.
Databend recently introduced z3.rs to solve constraints. Z3 is a theorem prover from Microsoft Research that is usually used to solve SMT problems.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Release proposal: Nightly v1.2

Databend v1.2 is scheduled for release on May 15th and includes the following features:

Task	Status	Comments
(Query) JSON indexing#6994	IN PROGRESS	high-priority
(Query) Fulltext index#3915	IN PROGRESS	high-priority
(Query+Storage) Merge Into#10174	PLAN	high-priority
(Query)Window function#10810	IN PROGRESS
(Query)Distributed COPY#8594	PLAN
(Query)Lambda function and high-order functions	PLAN
(Storage) Fuse engine re-clustering	PLAN	high-priority
(Storage) Fuse engine segment tree	IN PROGRESS	high-priority
(Storage) Fuse engine orphan data cleanup	IN PROGRESS	high-priority
(Storage) Compress Fuse meta data#10265	IN PROGRESS	high-priority
Embedding Model #10689	IN PROGRESS	high-priority
Search optimization SQL framework	IN PROGRESS
Distributed group by with dictionary	IN PROGRESS

Please note that this is a release plan and may be subject to change. We encourage community members to participate in the discussion and provide feedback on the plan.

Issue #11073 | Release proposal: Nightly v1.2

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.0.57-nightly...v1.1.2-nightly

This Week in Databend #88

What's On In Databend

Stay connected with the latest news about Databend.

Support Eager Aggregation

Eager aggregation helps improve the performance of queries that involve grouping and joining data. It works by partially pushing a groupby past a join, which reduces the number of input rows to the join and may result in a better overall plan.

Databend recently added support for Eager aggregation. Here is an example of how it works.

Input:
                expression
                    |
         aggregate: SUM(x), SUM(y)
                    |
                   join
                    | \
                    | (y)
                    |
                   (x)
(1) Eager Groupby-Count:
                expression
                    |
     final aggregate: SUM(eager SUM(x)), SUM(y * cnt)
                    |
                    join
                    | \
                    | (y)
                    |
                    eager group-by: eager SUM(x), eager count: cnt
(2) Eager Split:
                expression
                    |
     final aggregate: SUM(eager SUM(x) * cnt2), SUM(eager SUM(y) * cnt1)
                    |
                    join
                    | \
                    | eager group-by: eager SUM(y), eager count: cnt2
                    |
                    eager group-by: eager SUM(x), eager count: cnt1

If you are interested in learning more, please check out the resources listed below.

Support All TPC-DS Queries

Databend now supports all TPC-DS queries!

TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general-purpose decision support system.

If you are interested in learning more, please check out the resources listed below.

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

databend-driver - A driver for Databend in Rust

The Databend community has crafted a Rust driver that allows developers to connect to Databend and execute SQL queries in Rust.

Here's an example of how to use the driver:

use databend_driver::new_connection;

let dsn = "databend://root:@localhost:8000/default?sslmode=disable";
let conn = new_connection(dsn).unwrap();

let sql_create = "CREATE TABLE books (
    title VARCHAR,
    author VARCHAR,
    date Date
);";
conn.exec(sql_create).await.unwrap();
let sql_insert = "INSERT INTO books VALUES ('The Little Prince', 'Antoine de Saint-Exupéry', '1943-04-06');";
conn.exec(sql_insert).await.unwrap();

Feel free to try it out and give us feedback. For more information, follow the resources listed below.

AskBend - SQL-based Knowledge Base Search and Completion

AskBend is a Rust project that utilizes the power of Databend and OpenAI to create a SQL-based knowledge base from Markdown files.

With AskBend, you can easily search and retrieve the most relevant information to your queries using SQL. The project automatically generates document embeddings from the content, enabling you to quickly find the information you need.

How it works:

Read and parse Markdown files from a directory.
Store the content in the askbend.doc table.
Compute embeddings for the content using Databend Cloud's built-in AI capabilities.
When a users asks a question, generate the embedding using Databend Cloud's SQL-based ai_embedding_vector function.
Find the most relevant doc.content using Databend Cloud's SQL-based cosine_distance function.
Use OpenAI's completion capabilities with Databend Cloud's SQL-based ai_text_completion function.
Output the completion result in Markdown format.

If you want to learn more about AskBend or try out the existing live demo, you can refer to the resources listed below:

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

New Aggregation Functions Added: QUANTILE_DISC, KURTOSIS, SKEWNESS
Learn everything about AI functions in Databend: Docs - AI Functions

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add Nullable Table Schema Tests to Databend

Currently, Databend table schema is not nullable by default. So almost all of tests table schemas are not nullable, we need to add some tests which table schemas are nullable to cover.

To achieve this goal, we need to add some new test cases in Databend. These test cases should include nullable table schemas to ensure that Databend can handle these cases correctly.

Issue #10969 | test: add some tests which table schemas are nullable

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@Dousir9 made their first contribution in #10884. The PR fixes the wrong cardinality estimation when the aggregation function's argument has multiple columns.
@YimingQiao made their first contribution in #10906. The PR adds function summarization of KURTOSIS and SKEWNESS and reorders the functions to make it consistent with the function order in the navigation bar.
@jsoref made their first contribution in #10914. The PR helps improve the quality of the code and documentation by fixing spelling errors.
@leiwenfang made their first contribution in #10917. The PR beautifies the covers of blog.
@ArberSephirotheca made their first contribution in #10949. The PR adds a new function called to_unix_timestamp() which converts Databend timestamp to Unix timestamp.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.0.46-nightly...v1.0.57-nightly

This Week in Databend #87

What's On In Databend

Stay connected with the latest news about Databend.

Window Function

Aggregate window functions can operate on a group of rows and return a single value for each row in the underlying query. The OVER clause specifies how to partition the rows in the result set. When used with GROUP BY, aggregate window functions do not collapse rows but instead return all rows in the result set.

-- use aggregate window function
SELECT date, AVG(amount) over (partition by date)
FROM BookSold

June 21|544.0
June 21|544.0
June 22|454.5
June 22|454.5
June 23|643.0
June 23|643.0

Databend supports all aggregate functions as aggregate window functions.

Suggest Function Name on Typos

Databend has added an intelligent feature that automatically provides the closest matching item when you enter an incorrect function name.

#> select base64(1);
ERROR 1105 (HY000) at line 1: Code: 1008, displayText = error:
  --> SQL:1:8
  |
1 | select base64(1)
  |        ^^^^^^^^^ no function matches the given name: 'base64', do you mean 'to_base64'?

PR | feat(planner): suggest function name on typo

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Dump Running Async Task Stack

Databend now supports dumping the running async task stacks. Simply visit http://<admin_api_address>/debug/async_tasks/dump to capture it in your browser.

Dump Running Async Task Stack

Calling the async_backtrace::taskdump_tree function can obtain information about the asynchronous task tree (collected by #[async_backtrace::framed]).

    let tree =
        async_backtrace::taskdump_tree(req.map(|x| x.wait_for_running_tasks).unwrap_or(false));

Tasks are divided into regular tasks and polling tasks (marked with [POLLING]). Record the stack information for each task, and output it to a string sorted by stack depth.

    for mut tasks in [tasks, polling_tasks] {
        tasks.sort_by(|l, r| Ord::cmp(&l.stack_frames.len(), &r.stack_frames.len()));

        for item in tasks.into_iter().rev() {
            for frame in item.stack_frames {
                writeln!(output, "{}", frame).unwrap();
            }

            writeln!(output).unwrap();
        }
    }

To learn more about how it works, refer to the following link:

PR | feat(query): try support dump running async task stack

New Way to Integrate with Jupyter Notebook

As described in Doc | Visualization Databend Data in Jupyter Notebook, we can use Jupyter Notebook to explore data in Databend.

However, through the trick of magic, ipython-sql provides another interesting way to make SQL queries look like they are running in SQL cells and maintain seamless integration with Python.

Install Dependencies

pip install ipython-sql databend-sqlalchemy

Work with Jupyter Notebook

In [1]: %load_ext sql

In [2]: %%sql databend://{username}:{password}@{host_port_name}/{database_name}?secure=false
   ...: SHOW DATABASES;

In [3]: result = %%sql SELECT * FROM numbers(100);

In [4]: %matplotlib inline
   ...: df = result.DataFrame()
   ...: df.plot()

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Connecting to Databend with DBeaver, either by selecting the pre-configured MySQL driver or adding the Databend JDBC driver.
Databend provides integration with Redash to help you gain better insights into your data. Doc | Integrations - Redash
Master how to display information of columns in a given table. Doc | SHOW COLUMNS
Databend now supports generate_series and range table functions.
Databend now supports the ai_embedding_vector function, which returns a 1536-dimensional f32 vectors generated by the OpenAI Embeddings API.
Databend added support for [CREATE | DROP | SHOW] SHARE ENDPOINT DDL.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Collect Metrics from Sled

sled is an embedded database. It has a metrics feature that exposed some metrics.

Databend Meta Service uses sled as the underlying Key-Value storage. We expect to obtain more metrics about sled in order to further improve observability and help with optimization.

Issue #7233 | make use of sled metrics feature for collect sled metrics

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@rkmdCodes made their first contribution in #10269. This pull request improves the error string and makes the prompt clearer.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.0.33-nightly...v1.0.46-nightly

This Week in Databend #86

What's On In Databend

Stay connected with the latest news about Databend.

FlightSQL Handler in Progress

Flight SQL is an innovative open database protocol that caters to modern architectures. It boasts a columnar-oriented design and provides seamless support for parallel processing of data partitions.

The benefits of supporting FlightSQL include reducing serialization and deserialization during query execution, as well as easily supporting SDKs in different languages using predefined *.proto files.

We're currently engaged in developing support for the FlightSQL Handler. If you're interested, refer to the following links:

Natural Language to SQL

By integrating with the popular AI services, Databend now provide you an efficient built-in solution - the AI_TO_SQL function.

With this function, instructions written in natural language can be converted into SQL query statements aligned with table schema. With just a few modifications (or possibly none at all), it can be put into production.

SELECT * FROM ai_to_sql(
    'List the total amount spent by users from the USA who are older than 30 years, grouped by their names, along with the number of orders they made in 2022',
    '<openai-api-key>');
*************************** 1. row ***************************
     database: openai
generated_sql: SELECT name, SUM(price) AS total_spent, COUNT(order_id) AS total_orders
               FROM users
                        JOIN orders ON users.id = orders.user_id
               WHERE country = 'USA' AND age > 30 AND order_date BETWEEN '2022-01-01' AND '2022-12-31'
               GROUP BY name;

The function is now available on both Databend and Databend Cloud. To learn more about how it works, refer to the following links:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Vector Similarity Calculation in Databend

Databend has added a new function called cosine_distance. This function accepts two input vectors, from and to, which are represented as slices of f32 values.

select cosine_distance([3.0, 45.0, 7.0, 2.0, 5.0, 20.0, 13.0, 12.0], [2.0, 54.0, 13.0, 15.0, 22.0, 34.0, 50.0, 1.0]) as sim
----
0.1264193

The Rust implementation efficiently performs calculations by utilizing the ArrayView type from the ndarray crate.

pub fn cosine_distance(from: &[f32], to: &[f32]) -> Result<f32> {
    if from.len() != to.len() {
        return Err(ErrorCode::InvalidArgument(format!(
            "Vector length not equal: {:} != {:}",
            from.len(),
            to.len(),
        )));
    }

    let a = ArrayView::from(from);
    let b = ArrayView::from(to);
    let aa_sum = (&a * &a).sum();
    let bb_sum = (&b * &b).sum();

    Ok((&a * &b).sum() / ((aa_sum).sqrt() * (bb_sum).sqrt()))
}

Do you remember how to register scalar functions in Databend? You can check Doc | How to Write a Scalar Function and PR | #10737 to verify your answer.

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

Learn how to monitor Databend using Prometheus and Grafana: Doc | Monitor - Prometheus & Grafana
Metabase Databend Driver helps you connect Databend to Metabase and dashboard your data: Doc | Integrations - Metabase
Databend now supports PIVOT, UNPIVOT, GROUP BY CUBE and GROUP BY ROLLUP query syntax. For more information, please see PR #10676 and #10601.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Enable `-Zgitoxide` to Speed up Git Dependencies Download

Enabling -Zgitoxide can speed up the download of our Git dependencies significantly, which is much faster than using Git only.

This feature integrates cargo with gitoxide, a pure Rust implementation of Git that is idiomatic, lean, fast, and safe.

Issue #10466 | CI: Enable -Zgitoxide to speed our git deps download speed

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@SkyFan2002 made their first contribution in #10656. This pull request aimed to resolve inconsistent results caused by variations in column name case while executing SQL statements with EXCLUDE.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.0.22-nightly...v1.0.33-nightly

This Week in Databend #85

📢 Read our blog Way to Go: OpenDAL successfully entered Apache Incubator to learn about the story of OpenDAL.

What's On In Databend

Stay connected with the latest news about Databend.

Data Type: MAP

The MAP data structure holds Key:Value pairs using a nested Array(Tuple(key, value)) and is useful when the data type is constant but the Key's value cannot be fully determined. The Key must be of a specified basic data type and duplicates are not allowed, while the Value can be any data type including nested arrays or tuples. A bloom filter index is created in Map makes it easier and faster to search for values in MAP.

select * from nginx_log where log['ip'] = '205.91.162.148';
+----+----------------------------------------+
| id | log                                    |
+----+----------------------------------------+
| 1  | {'ip':'205.91.162.148','url':'test-1'} |
+----+----------------------------------------+
1 row in set

If you want to learn more about the Map data type, please read the following materials:

Docs | Data Types - Map

Data Transformation During Loading Process

Do you remember the two RFCs mentioned last week? Now, Databend has added support for data transformation during the loading process into tables. Basic transformation operations can be achieved by using the COPY INTO <table> command.

CREATE TABLE my_table(id int, name string, time date);

COPY INTO my_table
FROM (SELECT t.id, t.name, to_date(t.timestamp) FROM @mystage t)
FILE_FORMAT = (type = parquet) PATTERN='.*parquet';

This feature avoids storing pre-transformed data in temporary tables and supports column reordering, column omission, and type conversion operation. In addition, partial data can be loaded from staged Parquet files or their columns can be rearranged. This feature simplifies and streamlines ETL processes, allowing uses to give more attentions on the data analysis without considering how to move their data.

If you're interested, check the following documentation:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Run Multiple Futures Parallel

Are you interested in how to run futures in parallel? It is worth mentioning that Databend has greatly improved the scanning performance in situations with a huge number of files by utilizing this technique.

The following code, which is less than 30 lines long, will introduce you to how it all works.

/// Run multiple futures parallel
/// using a semaphore to limit the parallelism number, and a specified thread pool to run the futures.
/// It waits for all futures to complete and returns their results.
pub async fn execute_futures_in_parallel<Fut>(
    futures: impl IntoIterator<Item = Fut>,
    thread_nums: usize,
    permit_nums: usize,
    thread_name: String,
) -> Result<Vec<Fut::Output>>
where
    Fut: Future + Send + 'static,
    Fut::Output: Send + 'static,
{
    // 1. build the runtime.
    let semaphore = Semaphore::new(permit_nums);
    let runtime = Arc::new(Runtime::with_worker_threads(
        thread_nums,
        Some(thread_name),
    )?);

    // 2. spawn all the tasks to the runtime with semaphore.
    let join_handlers = runtime.try_spawn_batch(semaphore, futures).await?;

    // 3. get all the result.
    future::try_join_all(join_handlers)
        .await
        .map_err(|e| ErrorCode::Internal(format!("try join all futures failure, {}", e)))
}

If you are interested in this Rust trick, you can read this PR: feat: improve the parquet get splits to parallel.

How to Create a System Table

System tables are tables that provide information about Databend's internal state, such as databases, tables, functions, and settings.

If you're interested in creating a system table, check out our recently released documentation which introduces the implementation, registration, and testing of system tables, using the system.credits table as an example.

Here is a code snippet:

let table_info = TableInfo {
    desc: "'system'.'credits'".to_string(),
    name: "credits".to_string(),
    ident: TableIdent::new(table_id, 0),
    meta: TableMeta {
        schema,
        engine: "SystemCredits".to_string(),
        ..Default::default()
    },
..Default::default()
};

Docs | How to Create a System Table

Highlights

Here are some noteworthy items recorded here, perhaps you can find something that interests you.

MindsDB's machine learning capability has now been integrated into Databend: Bringing in-database ML to Databend
Are you interested in using WebHDFS as the storage for Databend? This blog post may be helpful for you. How to Configure WebHDFS as a Storage Backend for Databend

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support Quantile with A List

After the merge of PR #10474, Databend began to support quantile aggregation functions, but currently only supports setting a single floating-point value as the level. If it could also support passing in a list, it may help simplify SQL writing in some scenarios.

SELECT QUANTILE([0.25, 0.5, 0.75])(number) FROM numbers(25);
+-------------------------------------+
| quantile([0.25, 0.5, 0.75])(number) |
+-------------------------------------+
|            [6, 12, 18]              |
+-------------------------------------+

Feature: quantile support list and add functions kurtosis() and skewness()

Additionally, the kurtosis(x) and skewness(x) mentioned in this issue seem a good starting point for contributing to Databend.

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.0.11-nightly...v1.0.21-nightly

This Week in Databend #84

📢 We announced the milestone release of Databend 1.0 this week. For what's new in the release, please read this post: https://www.databend.com/blog/databend-release-v1.0 .

What's On In Databend

Stay connected with the latest news about Databend.

SQL: `REPLACE INTO`

The REPLACE INTO statement is a powerful way to insert or update data in Databend. It allows you to specify a conflict key that determines whether a new row should be inserted or an existing row should be updated.

If a row with the same conflict key already exists in the table, Databend will update the row with the new data. Otherwise, the new data will be added to the table as a new row. You can use this statement to easily sync data from different sources or handle duplicate records.

#> CREATE TABLE employees(id INT, name VARCHAR, salary INT);
#> REPLACE INTO employees (id, name, salary) ON (id) VALUES (1, 'John Doe', 50000);
#> SELECT  * FROM Employees;
+------+----------+--------+
| id   | name     | salary |
+------+----------+--------+
|    1 | John Doe |  50000 |
+------+----------+--------+

If you want to learn more details about REPLACE INTO statement, please read the materials listed below.

RFC: Add Incremental Update for Copy Into

Databend is currently capable of transforming and inserting data from a stage into a table. For example, you can run a SQL statement like this:

insert into table1 from (select c1， c2 from @stage1/path/to/dir);

The COPY INTO command needs a similar feature as well to support incremental data loading from a stage.

If you're interested, check the following RFCs:

Code Corner

Discover some fascinating code snippets or projects that showcase our work or learning journey.

Rust Toolchain: 🦀 v1.70.0-nightly (03-10)

Databend has upgraded its Rust toolchain to nightly-2023-03-10. Please remember to run cargo clean after pulling the latest code.

We also fixed some clippy warnings. We now use #[default] to annotate the default value of an enum, as shown below.

#[derive(Debug, Default)]
pub enum RecordDelimiter {
    #[default]
    Crlf,
    Any(u8),
}

If you are interested in this Rust trick, you can read this RFC: derive_default_enum.

Announcing OpenDAL v0.30

OpenDAL brings some important changes in its newly released 0.30 version, including removing Object Abstraction and bindings for JavaScript and Python.

Removing Object Abstraction

The term Object was applied in various fields, so it used to be difficult to provide a concise definition of opendal::Object. We eliminated the intermediate API layer of Object so users can now directly utilize the Operator, for example:

# get metadata of a path
- op.object("path").stat().await;
+ op.stat("path").await;

Bindings for JavaScript and Python

We have released OpenDAL's JavaScript and Python bindings, thanks to @suyanhanx, @messense, @Xuanwo, and others who were involved, Now, users of these languages can access data from various services using OpenDAL. This is especially useful for data scientists and data analysts who only require OpenDAL and do not need any other SDKs.

Here is a Python example:

>>> import opendal
>>> op = opendal.Operator('memory')
>>> op.write("test", b"Hello, World!")
>>> op.read("test")
b'Hello, World!'

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support More Mathematical Operators like PostgreSQL

More mathematical operators in SQL expressions are required to perform complex calculations instead of writing them in another language or using external tools.

Operator	Description	Example	Result
^	exponentiation	2.0 ^ 3.0	8
\|/	square root	\|/ 25.0	5
\|\|/	cube root	\|\|/ 27.0	3
!	factorial	5 !	120
@	absolute value	@ -5.0	5
&	bitwise AND	91 & 15	11
\|	bitwise OR	32 \| 3	35
#	bitwise XOR	17 # 5	20
~	bitwise NOT	~1	-2
<<	bitwise shift left	1 << 4	16
>>	bitwise shift right	8 >> 2	2

@jun0315 is working on it and has made some great progress. If you are interested in this feature, feel free to drop in and contribute.

Issue 10233: Feature: support more pg functions

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@jun0315 made their first contribution in #10347, adding the caret function that performs exponentiation. For example, 2^3 means 2 raised to the power of 3, which equals 8.
@quaxquax made their first contribution in #10465, fixing a typo, thanks!

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v1.0.1-nightly...v1.0.10-nightly

This Week in Databend #83

📢 Databend's official release of version 1.0 is just around the corner, and a preview version, v1.0.0-nightly, is already available as a candidate. Check it out at https://github.com/datafuselabs/databend/releases/tag/v1.0.0-nightly

What's On In Databend

Stay connected with the latest news about Databend.

Support for WebHDFS

HDFS is a popular distributed file system for managing big data and is one of the storage backends supported by Databend. However, incorporating HDFS into Databend required a Java environment and specific jars, which were not suitable in some circumstances.

Databend now offers support for WebHDFS as a storage backend, allowing users to interact with existing HDFS storage without requiring Java environment. WebHDFS is a REST API that exposes HDFS operations through HTTP, making it more accessible to users.

#> CREATE STAGE IF NOT EXISTS whdfs URL='webhdfs://127.0.0.1:9870/data-files/' CONNECTION=(HTTPS='false');
Query OK, 0 rows affected (0.020 sec)

#> CREATE TABLE IF NOT EXISTS books (     title VARCHAR,     author VARCHAR,     date VARCHAR );
Query OK, 0 rows affected (0.030 sec)

#> COPY INTO books FROM @whdfs FILES=('books.csv') file_format=(type=CSV field_delimiter=','  record_delimiter='\n' skip_header=0);
Query OK, 2 rows affected (0.615 sec)

#> SELECT * FROM books;
+------------------------------+---------------------+------+
| title                        | author              | date |
+------------------------------+---------------------+------+
| Transaction Processing       | Jim Gray            | 1992 |
| Readings in Database Systems | Michael Stonebraker | 2004 |
+------------------------------+---------------------+------+
2 rows in set (0.044 sec)

Check out these pull requests if you're interested in how it works and improving your analytical experience:

Support for Aggregation Spilling to Object Storage

One of the challenges of processing large data volumes is how to perform GroupBy and OrderBy operations efficiently and reliably. To address this challenge, the Databend community is working on a new feature that involves spilling intermediate results to cloud-based object storage like Amazon S3.

This feature will enable Databend to handle GroupBy and OrderBy queries with unlimited data size, without running out of memory or compromising performance. If you want to learn more about how this feature works and how it can benefit your analytics needs, please check out this pull request:

PR | feat(query): support aggregate spill to object storage

Decimal Data Types

Databend supports various data types for analytics. One of the data types that Databend has recently improved is decimal, which can store exact numeric values with a fixed precision and scale. Decimal types are useful for applications that require high accuracy, such as financial calculations.

-- Create a table with decimal data type.
create table decimal(value decimal(36, 18));

-- Insert two values.
insert into decimal values(0.152587668674722117), (0.017820781941443176);

select * from decimal;
+----------------------+
| value                |
+----------------------+
| 0.152587668674722117 |
| 0.017820781941443176 |
+----------------------+

If you're interested in how Databend handles decimal types efficiently and accurately, please read on:

Docs | Data Types - Decimal.

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Initialize `regex` at Compile Time

If you read the code in federated_helper.rs, you may notice that numerous regular expressions are employed to match queries.

use regex::bytes::RegexSet;

let regex_set = RegexSet::new(regex_rules).unwrap();
let matches = regex_set.matches(query.as_ref());

This can be optimized by initializing the regex at compile time.

Issue 10286: Feature: make regexp initialized at compile time

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

New Contributors

We always open arms to everyone and can't wait to see how you'll help our community grow and thrive.

@xinlifoobar made their first contribution in #10164
@wangjili8417 made their first contribution in #10255
@dusx1981 made their first contribution in #10024

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Full Changelog: https://github.com/datafuselabs/databend/compare/v0.9.48-nightly...v1.0.0-nightly

This Week in Databend #82

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

AST

select from stage support uri with connection options (#10066)

Catalog

Iceberg/create-catalog (#9017)

Expression

type decimal support agg func min/max (#10085)
add sum/avg for decimal types (#10059)

Pipeline

enrich core pipelines processors (#10098)

Query

create stage, select stage, copy, infer_schema support named file format (#10084)
query result cache (#10042)

Storage

table data cache (#9772)
use drop_table_by_id api in drop all (#10054)
native storage format support nested data types (#9798)

Code Refactoring 🎉

Meta

add compatible layer for upgrade (#10082)
More elegant error handling (#10112, #10114, etc.)

Cluster

support exchange sorting (#10149)

Executor

add check processor graph completed (#10166)

Planner

apply constant folder at physical plan builder (#9889)

Query

use accumulating to impl single state aggregator (#10125)

Storage

adopt OpenDAL's batch delete support (#10150)
adopt OpenDAL query based metadata cache (#10162)

Build/Testing/CI Infra Changes 🔌

release deb repository (#10080)
release with systemd units (#10145)

Bug Fixes 🔧

Expression

no longer return Variant as common super type (#9961)
allow auto cast from string and variant (#10111)

Cluster

fix limit query hang in cluster mode (#10006)

Storage

wrong column statistics when contain tuple type (#10068)
compact not work as expected with add column (#10070)
fix add column min/max stat bug (#10137)

What's On In Databend

Stay connected with the latest news about Databend.

Query Result Cache

In the past week, Databend now supports caching of query results!

             ┌─────────┐ 1  ┌─────────┐ 1
             │         ├───►│         ├───►Dummy───►Downstream
Upstream────►│Duplicate│ 2  │         │ 3
             │         ├───►│         ├───►Dummy───►Downstream
             └─────────┘    │         │
                            │ Shuffle │
             ┌─────────┐ 3  │         │ 2  ┌─────────┐
             │         ├───►│         ├───►│  Write  │
Upstream────►│Duplicate│ 4  │         │ 4  │ Result  │
             │         ├───►│         ├───►│  Cache  │
             └─────────┘    └─────────┘    └─────────┘

Learn More

Table Data Cache

Databend now supports table data cache:

disk cache: raw column(compressed) data of the data block.
in-memory cache(experimental): deserialized column objects of a data block.

For cache-friendly workloads, the performance gains are significant.

Learn More

Deb Source & Systemd Support

Databend now offers the official Deb package source and supports the use of systemd to manage the service.

For DEB822 Source Format:

sudo curl -L -o /etc/apt/sources.list.d/datafuselabs.sources https://repo.databend.rs/deb/datafuselabs.sources
sudo apt update
sudo apt install databend
sudo systemctl start databend-meta
sudo systemctl start databend-query

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Service Activation Progress Report

When starting a Query/Meta node, it is necessary to perform checks and output them explicitly to help the user diagnose faults and confirm status.

Example:

storage check succeed
meta check failed: timeout, no response. endpoints: xxxxxxxx .
status check failed: address already in use.

Issue 10193: Feature: output the necessary progress when starting a query/meta node

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	Big-Wuu	BohuTANG	cameronbraid


Chasen-Zhang	ClSlaid	dantengsky	drmingdrmer	everpcpc	johnhaxx7


lichuang	mergify[bot]	PsiACE	RinChanNOWWW	soyeric128	sundy-li


suyanhanx	TCeason	Xuanwo	xudong963	youngsofun	zhang2014


zhyass

This Week in Databend #81

What's New

Check out what we've done this week to make Databend even better for you.

Accepted RFCs 🛫

rfc: query result cache (#10014)

Features & Improvements ✨

Planner

support EXPLAIN ANALYZE statement to profile query execution (#10023)
derive new filter and push down (#10021)

Query

alter table add/drop column SQL support (#9851)
new table function infer_schema (#9936)
add privilege check for select (#9924)
improve signed numeric keys (#9978)
support to parse jwt metadata and add multiple identity issuer configuration (#9971)
support create file format (#10009)

Storage

adopt OpenDAL's native scan support (#9985)
add drop_table_by_id api (#9990)

Expression

add operation for decimal (#9926)

Functions

support array_any function (#9953)
support array_sort (#9941)

Sqllogictest

add time travel test for alter table (#9939)

Code Refactoring 🎉

Meta

move application level types such as user/role/storage-config to crate common-meta/app (#9944)
fix abuse of ErrorCode (#10056)

Query

use transform_sort_merge use heap to sort blocks (#10047)

Storage

introduction of FieldIndex and ColumnId types for clear differentiation of use (#10017)

Build/Testing/CI Infra Changes 🔌

run benchmark for clickbench result format (#10019)
run benchmark both s3 & fs (#10050)

Bug Fixes 🔧

Privilege

add privileges on system.one to PUBLIC by default (#10040)

Catalog

parts was not distributed evenly (#9951)

Planner

type assertion failed on subquery (#9937)
enable outer join to inner join optimization (#9943)
fix limit pushdown outer join (#10043)

Query

fix add column update bug (#10037)

Storage

fix sub-column of added-tuple column return default 0 bug (#9973)
new bloom filter that bind index with Column Id instead of column name (#10022)

What's On In Databend

Stay connected with the latest news about Databend.

RFC: Query Result Cache

Caching the results of queries against data that doesn't update frequently can greatly reduce response time. Once cached, the result will be returned in a much shorter time if you run the query again.

How to Write a Scalar / Aggregate Function

Did you know that you can enhance the power of Databend by creating your own scalar or aggregate functions? Fortunately, it's not a difficult task!

The following guides are intended for Rust developers and Databend users who want to create their own workflows. The guides provide step-by-step instructions on how to create and register your own functions using Rust, along with code snippets and examples of various types of functions to walk you through the process.

Profile-Guided Optimization

Profile-guided optimization (PGO) is a compiler optimization technique that collects execution data during the program runtime and allows for tailoring optimizations tailored to both hot and cold code paths.

In this blog, we'll guide you through the process of optimizing Databend binary builds using PGO. We'll use Databend's SQL logic tests as an example to illustrate the step-by-step procedure.

Please note that PGO always requires generating perf data using workloads that are statistically representative. However, there's no guarantee that performance will always improve. Decide whether to use it based on your actual needs.

Learn More

Blog | Optimizing Databend Binary Builds with Profile-guided Optimization

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

To make our documentation clearer and easier to understand, we plan to restructure our function-related documentation to follow the same format as DuckDB's documentation. This involves breaking down the task into smaller sub-tasks based on function categories, so that anyone who wants to help improve Databend's documentation can easily get involved.

Issue 10029: Tracking: re-org the functions doc

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	b41sh	Big-Wuu	BohuTANG	dantengsky	dependabot[bot]


drmingdrmer	everpcpc	flaneur2020	johnhaxx7	leiysky	lichuang


mergify[bot]	PsiACE	RinChanNOWWW	soyeric128	sundy-li	TCeason


wubx	Xuanwo	xudong963	xxchan	youngsofun	yufan022


zhang2014	ZhiHanZ	zhyass

This Week in Databend #80

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Meta

add databend-meta config grpc_api_advertise_host (#9835)

AST

select from stage with files/pattern (#9877)
parse decimal type (#9894)

Expression

add Decimal128 and Decimal256 type (#9856)

Functions

support array_indexof (#9840)
support array function array_unique, array_distinct (#9875)
support array aggregate functions (#9903)

Query

add column id in TableSchema; use column id instead of index when read and write data (#9623)
support view in system.columns (#9853)

Storage

ParquetTable support topk optimization (#9824)

Sqllogictest

leverage sqllogictest to benchmark tpch (#9887)

Code Refactoring 🎉

Meta

remove obsolete meta service api read_msg() and write_msg() (#9891)
simplify UserAPI and RoleAPI by introducing a method update_xx_with(id, f: FnOnce) (#9921)

Cluster

split exchange source to reader and deserializer (#9805)
split and eliminate the status for exchange transform and sink (#9910)

Functions

rename some array functions add array_ prefix (#9886)

Query

TableArgs preserve info of positioned and named args (#9917)

Storage

ParquetTable list file in read_partition (#9871)

Build/Testing/CI Infra Changes 🔌

support for running benchmark on PRs (#9788)

Bug Fixes 🔧

Functions

fix nullable and or domain cal (#9928)

Planner

fix slow planner when ndv error backtrace (#9876)
fix order by contains aggregation function (#9879)
prevent panic when delete with subquery (#9902)

Query

fix insert default value datatype (#9816)

What's On In Databend

Stay connected with the latest news about Databend.

Why You Should Try Sccache

Sccache is a ccache-like project started by the Mozilla team, supporting C/CPP, Rust and other languages, and storing caches locally or in a cloud storage backend. The community first added native support for the GitHub Action Cache Service to Sccache in version 0.3.3, then improved the functionality in v0.4.0-pre.6 so that the production CI can now use it.

Now, opendal, open-sourced by Datafuse Labs, acts as a storage access layer for sccache to interface with various storage services (s3/gcs/azlob etc.).

Learn More

Blog | Why You Should Try Sccache

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Try using `build-info`

To get information about git commits, build options and credits, we now use vergen and cargo-license.

build-info can collect build-information of your Rust crate. It might be possible to use it to refactor the relevant logic in common-building.

pub struct BuildInfo {
    pub timestamp: DateTime<Utc>,
    pub profile: String,
    pub optimization_level: OptimizationLevel,
    pub crate_info: CrateInfo,
    pub compiler: CompilerInfo,
    pub version_control: Option<VersionControl>,
}

Issue 9874: Refactor: Try using build-info

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	BohuTANG	dependabot[bot]	drmingdrmer


everpcpc	flaneur2020	johnhaxx7	leiysky	lichuang	mergify[bot]


PsiACE	RinChanNOWWW	soyeric128	sundy-li	TCeason	Xuanwo


xudong963	youngsofun	zhang2014

This Week in Databend #79

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

AST

add syntax about parsing presign options with content type (#9771)

Format

add TSV file format back (#9732)

Functions

support array functions prepend and append (#9844)
support array concat (#9804)

Query

add topn runtime filter in native storage format (#9738)
enable hashtable state pass from partial to final (#9809)

Storage

add pruning stats to EXPLAIN (#9724)
cache bloom index object (#9712)

Code Refactoring 🎉

'select from stage' use ParquetTable (#9801)

Meta

expose a single "kvapi" as public interface (#9791)
do not remove the last node from a cluster (#9781)

AST/Expression/Planner

unify Span and Result (#9713)

Executor

merge simple pipe and resize pipe (#9782)

Bug Fixes 🔧

Base

fix not linux and macos jemalloc fallback to std (#9786)

Config

fix table_meta_cache can't be disabled (#9767)

Meta

when import data to meta-service dir, the specified "id" has to be one of the "initial_cluster" (#9755)

Query

fix and refactor aggregator (#9748)
fix memory leak for data port (#9762)
fix panic when cast jsonb to string (#9813)

Storage

fix up max_file_size may oom (#9740)

What's On In Databend

Stay connected with the latest news about Databend.

DML Command - UPDATE

Modifies rows in a table with new values.

Note: Databend guarantees data integrity. In Databend, Insert, Update, and Delete operations are guaranteed to be atomic, which means that all data in the operation must succeed or all must fail.

Syntax

UPDATE <table_name>
SET <col_name> = <value> [ , <col_name> = <value> , ... ]
    [ FROM <table_name> ]
    [ WHERE <condition> ]

Learn More

Docs | DML Command - UPDATE

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Support Arrow Flight SQL Protocol

Currently Databend supports the MySQL protocol, and it would be great if Databend could support the Arrow Flight SQL protocol as well.

Typically a lakehouse stores data in parquet files using the MySQL protocol while Databend has to do deserialization from parquet to arrow and then back to MySQL data types. Again on the caller end users use data frames or MySQL result iterators, which also requires serialization of types. With Arrow Flight SQL all of these back and forth serialization costs can be avoided.

Issue 9832: Feature: Support Arrow Flight SQL protocol

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	BohuTANG	dantengsky	dependabot[bot]


drmingdrmer	everpcpc	flaneur2020	johnhaxx7	leiysky	mergify[bot]


PsiACE	RinChanNOWWW	soyeric128	sundy-li	TCeason	Xuanwo


youngsofun	yufan022	zhang2014

This Week in Databend #78

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

SQL

eliminate extra group by scalars (#9708)

Query

add privilege check for insert/delete/optimize (#9664)
enable empty projection (#9675)
add aggregate limit in final aggregate stage (#9716)
add optional column names to create/alter view statement (#9715)

Storage

add prewhere support in native storage format (#9600)

Code Refactoring 🎉

move io constants to common/io (#9700)
refine fuse/io/read (#9711)

Planner

rename Scalar to ScalarExpr (#9665)

Storage

refactor cache layer (#9672)
pruner.rs -> fuse_bloom_pruner.rs (#9710)
make pruner hierarchy to chain (#9714)

Build/Testing/CI Infra Changes 🔌

support setup minio storage & external s3 storage in docker image (#9676)

Bug Fixes 🔧

Expression

fix missing simple_cast (#9671)

Query

fix efficiently_memory_final_aggregator result is not stable (#9685)
fix max_result_rows only limit output results nums (#9661)
fix query hang in two level aggregator (#9694)

Storage

may get wrong datablocks if not sorted by output schema (#9470)
bloom filter is using wrong cache key (#9706)

What's On In Databend

Stay connected with the latest news about Databend.

Databend All-in-One Docker Image

Databend Docker Image now supports setting up MinIO storage and external AWS S3 storage.

Now you can easily use a Docker image for your first experiment with Databend.

Run with MinIO as backend

docker run \
    -p 8000:8000 \
    -p 9000:9000 \
    -e MINIO_ENABLED=true \
    datafuselabs/databend

Run with self managed query config

docker run \
    -p 8000:8000 \
    -e DATABEND_QUERY_CONFIG_FILE=/etc/databend/mine.toml \
    -v query_config_file:/etc/databend/mine.toml \
    datafuselabs/databend

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Vector Search

Vector search captures the meaning and context of unstructured data, and is commonly used for text or image processing, enabling the use of semantics to find similar results and obtain more valid results than traditional keyword retrieval.

Databend plans to provide users with a richer and more efficient means of querying by supporting vector search, and the introduction of Faiss Index may be an initial solution.

Issue 9699: feat: vector search (Faiss index)

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	BohuTANG	dantengsky	dependabot[bot]


everpcpc	flaneur2020	johnhaxx7	leiysky	mergify[bot]	PsiACE


RinChanNOWWW	sandflee	sundy-li	xudong963	zhang2014	zhyass

This Week in Databend #77

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Meta

use expression::TableSchema to replace obsolete datavalues::DataSchema (#9506)
iter() iterate every tree and every records in theses trees (#9621)

Expression

add other base geo functions (#9588)

Optimizer

improve cardinality estimation for join based on histogram (#9594)

Planner

improve join reorder algorithm (#9571)

Query

support insert with placeholder (#9575)
set setting support expr (#9574)
add information_schema for sharding-jdbc (#9583)
support named params for table functions (#9630)

Storage

read_parquet page index (#9563)
update interpreter and storage support (#9261)

Code Refactoring 🎉

refine on_error mode (#9473)

Meta

remove unused meta types and conversion util (#9584)

Parser

more strict parser for format_options (#9635)

Expression

rearrange common_expression and common_function (#9585)

Build/Testing/CI Infra Changes 🔌

run sqllogictests with binary (#9603)

Bug Fixes 🔧

Expression

constant folder should run repeatedly until stable (#9572)
check_date() and to_string(boolean) may panic (#9561)

Planner

fix stack overflow when applying RuleFilterPushDownJoin (#9645)

Storage

fix range filter read stat with index (#9619)

Sqllogictest

sqllogic test hangs (cluster mod + clickhouse handler) (#9615)

What's On In Databend

Stay connected with the latest news about Databend.

Upgrade Databend Query from 0.8 to 0.9

Databend-query-0.9 introduces incompatible changes in metadata, these metadata has to be manually migrated. Databend provides a program for this job: databend-meta-upgrade-09, which you can find in a release package or can be built from source.

Upgrade

databend-meta-upgrade-09 --cmd upgrade --raft-dir "<./your/raft-dir/>"

Learn More

Release Proposal: Nightly v1.0

The call for proposals for the release of v1.0 is now open.

The preliminary plan is to release in March, mainly focusing on alter table, update, and group by spill.

Learn More

Release proposal: Nightly v1.0

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add Type Checker for Sqllogictest

We can check if each row's each element's type is correct.

databend/tests/sqllogictests/src/client/mysql_client.rs

 // Todo: add types to compare
 Ok(DBOutput::Rows {
     types,
     rows: parsed_rows,

Issue 9647: Feature: Add type checker for sqllogictest

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

We're gearing up for the v0.9 release of Databend. Stay tuned.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	BohuTANG	dantengsky	drmingdrmer


everpcpc	leiysky	mergify[bot]	PsiACE	RinChanNOWWW	soyeric128


sundy-li	TCeason	Xuanwo	xudong963	youngsofun	yufan022


zhang2014	zhyass

This Week in Databend #76

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Meta

add reader-min-msg-ver and msg-min-reader-ver in proto-conv (#9535)

Planner

support tuple.1 and get(1)(tuple) (#9493)
support display estimated rows in EXPLAIN (#9528)

Query

efficiently memory two level group by in standalone mode (#9504)

Storage

support nested type in read_parquet (#9486)
add build options table (#9502)

Code Refactoring 🎉

merge new expression (#9411)
remove and rename crates (#9481)
bump rust version (#9540)

Expression

move negative functions to binder (#9484)
use error_to_null() to eval try_cast (#9545)

Functions

replace h3ron to h3o (#9553)

Format

extract AligningStateTextBased (#9472)
richer error context (#9534)

Query

use ctx to store the function evaluation error (#9501)
refactor map access to support view read tuple inner (#9516)

Storage

bump opendal for streaming read support (#9503)
refactor bloom index to use vectorized siphash function (#9542)

Bug Fixes 🔧

HashTable

fix memory leak for unsized hash table (#9551)

Storage

fix row group stats collection (#9537)

What's On In Databend

Stay connected with the latest news about Databend.

New Year, New Expression!

We're so thrilled to tell you that Databend now fully works with New Expression after more than a half year of dedicated work. New Expression introduces a formal type system to Databend and supports type-safe downward casting , making the definition of functions easier.

New Expression is still being tuned, and a new version (v0.9) of Databend will be released once the tuning work is complete.

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

`UNNEST` Function

The UNNEST function takes an array as a parameter, and returns a table containing each element of the array in a row.

Syntax

UNNEST(ARRAY) [WITH OFFSET]

If you're interested in becoming a contributor, helping us develop the UNNEST function would be a good start.

Issue 9549: Feature: Support unnest

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

We're gearing up for the v0.9 release of Databend. Stay tuned.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	BohuTANG	ClSlaid	dantengsky


dependabot[bot]	drmingdrmer	everpcpc	flaneur2020	leiysky	mergify[bot]


PsiACE	RinChanNOWWW	soyeric128	sundy-li	TCeason	wubx


Xuanwo	xudong963	youngsofun	zhang2014

This Week in Databend #75

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Format

basic output format JSON (#9447)

Query

check connection params (#9437)
add max_query_row_nums (#9406)

Storage

support prewhere in hive (#9427)
add generic cache trait for different object reader (#9436)
add metrics for new cache (#9445)

New Expression

migrate hash func to func-v2 (#9402)

Sqllogictest

run all tests in parallel (#9400)

Code Refactoring 🎉

Storage

add to_bytes and from_bytes for CachedObject (#9439)
better table-meta and parquet reader function (#9434)
convert fuse_snapshot unit tests to sqllogic test (#9428)

Bug Fixes 🔧

Format

catch unwind when read split (#9420)

User

lazy load JWKS (#9446)

Planner

create Stage URL's path should ends with / (#9450)

What's On In Databend

Stay connected with the latest news about Databend.

Databend 2022 Recap

Let's look back and see how Databend did in 2022.

Open source: got 2,000+ stars, merged 2,400+ PRs, and solved 1,900 issues.
From data warehouse to lakehouse: Brand-new design with enhanced capabilities.
Rigorous testing: SQL Logic Tests, SQLancer, and https://perf.databend.rs.
Building the ecosystem: More customers chose, trusted, and grew with Databend, including Kuaishou and SAP.
Databend Cloud: Built on top of Databend, the next big data analytics platform.

We wish everyone a Happy New Year and look forward to engaging with you.

Learn More

Blog | Databend 2022 Recap

Databend 2023 Roadmap

As the new year approaches, Databend is also actively planning its roadmap for 2023.

We will continue to polish the Planner and work on data and query caching. Enhancing storage and query issues for PB-level data volumes is also on our list.

Try Databend and join the roadmap discussion.

Learn More

Issue | Roadmap 2023

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Profile-Guided Optimization (PGO)

The basic concept of PGO is to collect data about the typical execution of a program (e.g. which branches it is likely to take) and then use this data to inform optimizations such as inlining, machine-code layout, register allocation, etc.

rustc supports doing profile-guided optimization (PGO). We expect to be able to use it to enhance the build.

Issue 9387: Feature: Add PGO Support

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


ariesdevil	BohuTANG	dantengsky	dependabot[bot]	everpcpc	flaneur2020


hantmac	leiysky	mergify[bot]	PsiACE	sandflee	soyeric128


sundy-li	TCeason	Xuanwo	xudong963	youngsofun	zhang2014

Databend 2022 Recap

The year is coming to an end, and Databend is about to enter its third year. Before we count down the new year, it's a good idea to look back and see how Databend did in 2022.

Open Source: Receiving Increased Attention

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

databend

The open-source philosophy has guided Databend from the very beginning. The entire team works seamlessly on GitHub where the Rust community and many data pros are fully involved. In 2022, the Databend repository:

Got 2,000+ stars, totaling 5,000 .
Merged 2,400+ PRs, totaling 5,600 .
Solved 1,900 issues, totaling 3,000 .
Received 16,000 commits, totaling 23,000 .
Attracted more contributors, totaling 138 .

Development: Inspired by Real Scenarios

Databend brought many new features and improvements in 2022 to help customers with their real work scenarios.

databend arch

Brand-New Data Warehouse

As a data warehouse inspired by and benchmarking itself against Snowflake and Clickhouse, Databend fully took advantage of "Cloud Native" to bring you a new design and implementation without breaking the balance between performance and maintainability:

Added support for Stage and Data Sharing, helping users manage their data life cycle with more options.
Introduced a new planner with user-friendly error prompts and efficient optimization techniques for the execution plan.
Redesigned the type system to support type checking and type-safe downward casting.
Enhanced the new processor framework: It can now work in both Pull and Push modes.
Added experimental support for Native Format to improve performance when running on a local disk.

Databend as Lakehouse

Storing and managing massive data is key to our vision "Databend as Lakehouse" . A lot of efforts have been made in 2022 for a larger data payload and a wider range of accepted data sources:

Adopted OpenDAL in the data access layer as a unified interface.
Expanded support for structured and semi-structured data.
Added the ability to keep multiple catalogs: This makes integrations with custom catalogs such as Hive much easier.
Added the ability to query data directly from a local, staged, or remote file.

Optimal Efficiency Ratio

After a year of continuous tuning, we brought Databend to a new stage featuring elastic scheduling and separating storage from compute. We're thrilled to see a significant improvement in the efficiency ratio:

In some scenarios, Databend works as efficiently as Clickhouse.
Lowered costs by 90% compared to Elasticsearch, and by 30% compared to Clickhouse.

Testing: Put Us at Ease

Comprehensive tests help make a database management system robust. While optimizing performance, we also care about the accuracy and reproducibility of SQL results returned from Databend.

databend perf

Correctness Testing

In 2022, we replaced stateless tests with SQL Logic Tests for Databend in the first place. We also migrated a large number of mature test cases to cover as many scenarios as possible. Afterward, we started to use a Rust native test program called sqllogictest-rs instead of the previous Python one, which saved us a lot of time on CI without losing the maintainability of the tests.

Furthermore, we also planned and implemented three types of automated testing (TLP, QPS, and NoREC) supported by SQLancer. All of them have been successfully merged into the main branch with dozens of bug fixes.

Performance Testing

Performance testing is also essential for us. In 2022, we launched a website (https://perf.databend.rs/) to track daily performance changes and spot potential issues. Meanwhile, we actively evaluated Databend against Clickbench and some other benchmarks.

Ecosystem: Give and Take

The Databend ecosystem and users benefit from each other. More and more users were attracted to the ecosystem and joined the community in 2022. As they brought their own creative ideas to Databend and made them come true, the Databend ecosystem made tremendous progress and started to flourish in the field.

Positive Expansion

We build and value the Databend ecosystem. Databend is now compatible with the MySQL protocol and Clickhouse HTTP Handler, and can seamlessly integrate with the following data services or utilities:

Airbyte
DBT
Addax (Datax)
Vector
Jupyter Notebook
DBeaver

To help users develop and customize services based on Databend, we developed drivers in multiple languages, including Python and Go.

Growing with Users

Users are the basis of Databend. They help develop Databend and stir up the whole community.

In 2022, Databend added support for the Hive Catalog with the help of Kuaishou Technology. This connected Databend to the Hive ecosystem and encouraged us to consider the possibility of multiple catalogs. DMALL implemented and verified data archiving with Databend. We also appreciate SHAREit, Voyance, DigiFinex, and Weimob for their trust and support.

The Databend ecosystem includes a few projects that are loved and trusted by other products:

OpenDAL now manages the data access layer for sccache , which provides further support for Firefox CI . Other database and data analysis projects, such as GreptimeDB and deepeth/mars , also used OpenDAL for data access.
OpenRaft was used to implement a Feature Registry (database to hold feature metadata) in Azure/Feathr. SAP, Huobi, and Meituan also used it in a few internal projects.
The MySQL protocol implementation in OpenSrv has been applied to multiple database projects such as GreptimeDB and CeresDB .

In 2022, the Databend community launched the "Data Infra Club" for knowledge sharing. Our friends from PingCAP, Kuaishou Technology, DMALL, and SHAREit were invited to share their insights on big data platforms, Data Mesh, and Modern Data Stack. You can find all the video replays on Bilibili if you're interested.

Going Cloud: Sky's the Limit

Going cloud is part of Databend's business strategy where most Databend users come from the cloud.

Built on top of Databend, Databend Cloud is a big-data analytics platform of the next generation, featuring easy-to-use , low-cost , and high-performance . Two versions of Databend Cloud are now available and open for trial:

https://databend.com (for international users)
https://databend.cn (for China users only)

This Week in Databend #74

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Meta

remove stream when a watch client is dropped (#9334)

Planner

support selectivity estimation for range predicates (#9398)

Query

support copy on error (#9312)
support databend-local (#9282)
external storage support location part prefix (#9381)

Storage

rangefilter support in (#9330)
try to improve object storage io read (#9335)
support table compression (#9370)

Metrics

add more metrics for fuse compact and block write (#9399)

Sqllogictest

add no-fail-fast support (#9391)

Code Refactoring 🎉

adopt rustls entirely, removing all deps to native-tls (#9358)

Format

remove format_xxx settings (#9360)
adjust interface of FileFormatOptionsExt (#9395)

Planner

remove SyncTypeChecker (#9352)

Query

split fuse source to read data and deserialize (#9353)
avoid io copy in read parquet data (#9365)
add uncompressed buffer for parquet reader (#9379)

Storage

add read/write settings (#9359)

Bug Fixes 🔧

Format

fix align_flush with header only (#9327)

Settings

use logical CPU number as default value of num_cpus (#9396)

Processors

the data type on both sides of the union does not match (#9361)

HTTP Handler

false alarm (warning log) about query not exists (#9380)

Sqllogictest

refactor sqllogictest http client and fix expression string like (#9363)

What's On In Databend

Stay connected with the latest news about Databend.

Introducing databend-local

Inspired by clickhouse-local, databend-local allows you to perform fast processing on local files, without the need of launching a Databend cluster.

> export CONFIG_FILE=tests/local/config/databend-local.toml
> cargo run --bin=databend-local -- --sql="SELECT * FROM tbl1" --table=tbl1=/path/to/databend/docs/public/data/books.parquet

exec local query: SELECT * FROM tbl1
+------------------------------+---------------------+------+
| title                        | author              | date |
+------------------------------+---------------------+------+
| Transaction Processing       | Jim Gray            | 1992 |
| Readings in Database Systems | Michael Stonebraker | 2004 |
| Transaction Processing       | Jim Gray            | 1992 |
| Readings in Database Systems | Michael Stonebraker | 2004 |
+------------------------------+---------------------+------+
4 rows in set. Query took 0.015 seconds.

Learn More

PR | support databend-local

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Compressing Short Strings

When processing the same queries with short strings involved, Databend usually reads more data than other databases, such as Snowflake.

SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;

Such queries might be more efficient if short strings (URLs, etc) are compressed.

Issue 9001: performance: compressing for short strings

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	BohuTANG	dantengsky	drmingdrmer	eastfisher


everpcpc	leiysky	mergify[bot]	PsiACE	RinChanNOWWW	soyeric128


sundy-li	Xuanwo	xudong963	youngsofun	zhang2014	zhyass

This Week in Databend #73

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Multiple Catalogs

implement show tables (from|in catalog.database) (#9153)

Planner

introduce histogram in column statistics (#9310)

Query

support attaching stage for insert values (#9249)
add native format in fuse table (#9279)
add internal_enable_sandbox_tenant config and sandbox_tenant (#9277)

Sqllogictest

introduce rust native sqllogictest framework (#9150)

Code Refactoring 🎉

unify apply_file_format_options for copy & insert (#9323)

remove unused code (#9266)

meta

test watcher count (#9324)

Planner

replace TableContext in planner with PlannerContext (#9290)

Bug Fixes 🔧

Base

try fix SIGABRT when catch unwind (#9269)
replace #[thread_local] to thread_local macro (#9280)

Query

fix unknown database in query without relation to this database (#9250)
fix wrong current_role when drop the role (#9276)

What's On In Databend

Stay connected with the latest news about Databend.

Introduced a Rust Native Sqllogictest Framework

Sqllogictest verifies the results returned from a SQL database engine by comparing them with the results of other engines for the same queries.

In the past, Databend ran such tests using a program written in Python and migrated a large number of test cases from other popular databases. We implemented the program again with sqllogictest-rs in recent days.

Learn More

Experimental: Native Format

PA is a native storage format based on Apache Arrow. Similar to Arrow IPC, PA aims at optimizing the storage layer.

Databend is introducing PA as a native storage format in the hope of getting a performance boost, though it's still at an early stage of development.

create table tmp (a int) ENGINE=FUSE STORAGE_FORMAT='native';

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Checking File Existence Before Returning Presigned URL

When presigning a file, Databend currently returns a potentially valid URL based on the filename without checking if the file really exists. Thus, the 404 error might occur if the file doesn't exist at all.

Issue 8702: Before return presign url add file exist judgement

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


ariesdevil	b41sh	BohuTANG	ClSlaid	drmingdrmer	everpcpc


leiysky	mergify[bot]	PsiACE	sandflee	soyeric128	sundy-li


Xuanwo	xudong963	youngsofun	zhang2014	ZhiHanZ	zhyass

This Week in Databend #72

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Multiple Catalogs

extends show databases SQL (#9152)

Stage

support select from URI (#9247)

Streaming Load

support file_format syntax in streaming load insert sql (#9063)

Planner

push down limit to union (#9210)

Query

use analyze table instead of optimize table statistic (#9143)
fast parse insert values (#9214)

Storage

use distinct count calculated by the xor hash function (#9159)
read_parquet read meta before read data (#9154)
push down filter to parquet reader (#9199)
prune row groups before reading (#9228)

Open Sharing

add prototype open sharing and add sharing stateful tests (#9177)

Code Refactoring 🎉

simplify the global data registry logic (#9187)

Storage

refactor deletion (#8824)

Build/Testing/CI Infra Changes 🔌

release databend deb package and databend with hive (#9138, #9241, etc.)

Bug Fixes 🔧

Format

support ASCII control code hex as format field delimiter (#9160)

Planner

prewhere_column empty and predicate is not const will return empty (#9116)
don't push down topk to Merge when it's child is Aggregate (#9183)
fix nullable column validity not equal (#9220)

Query

address unit test hang on test_insert (#9242)

Storage

too many io requests for read blocks during compact (#9128)
collect orphan snapshots (#9108)

What's On In Databend

Stay connected with the latest news about Databend.

Breaking Change: Unified File Format Options

To simplify, we're rolling out a set of unified file format options as follows for the COPY INTO command, the Streaming Load API, and all the other cases where users need to describe their file formats:

[ FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | XML} [ formatTypeOptions ] ) ]

Please note that the current format options starting with format_* will be deprecated.
... FORMAT CSV ... will still be accepted by the ClickHouse handler.
Support for customized formats created by CREATE FILE FORMAT ... will be added in a future release: ... FILE_FORMAT = (format_name = 'MyCustomCSV') .... .

Learn More

Open Sharing is a simple and secure data-sharing protocol designed for databend-query nodes running in a multi-cloud environment.

Simple & Free: Open Sharing is open-source and basically a RESTful API implementation.
Secure: Open Sharing verifies incoming requesters' identities and access permissions, and provides an audit log.
Multi-Cloud: Open Sharing supports a variety of public cloud platforms, including AWS, Azure, GCP, etc.

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

We're about to run stage-related tests again using the Streaming Load API to move files to a stage instead of an AWS command like this:

aws --endpoint-url ${STORAGE_S3_ENDPOINT_URL} s3 cp s3://testbucket/admin/data/ontime_200.csv s3://testbucket/admin/stage/internal/s1/ontime_200.csv >/dev/null 2>&1

This is because Databend users do not need to take care of, or do not even know the stage paths that the AWS command requires.

Issue 8528: refactor stage related tests

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


ariesdevil	b41sh	BohuTANG	Chasen-Zhang	ClSlaid	dantengsky


drmingdrmer	hantmac	lichuang	mergify[bot]	PsiACE	RinChanNOWWW


soyeric128	sundy-li	wubx	Xuanwo	xudong963	youngsofun


ZhiHanZ	zhyass	zzzdong

This Week in Databend #71

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Planner

optimize topk in cluster mode (#9092)

Query

support select * exclude [column_name | (col_name, col_name,...)] (#9009)
alter table flashback (#8967)
new table function read_parquet to read parquet files as a table (#9080)
support select * from @stage (#9123)

Storage

cache policy (#9062)
support hive nullable partition (#9064)

Code Refactoring 🎉

Memory Tracker

keep tracker state consistent (#8973)

REST API

drop ctx after query finished (#9091)

Bug Fixes 🔧

Configs

add more tests for hive config loading (#9074)

Planner

try to fix table name case sensibility (#9055)

Functions

vector_const like bug fix (#9082)

Storage

update last_snapshot_hint file when purge (#9060)

Cluster

try fix broken pipe or connect reset (#9104)

What's On In Databend

Stay connected with the latest news about Databend.

RESTORE TABLE

By the snapshot ID or timestamp you specify in the command, Databend restores the table to a prior state where the snapshot was created. To retrieve snapshot IDs and timestamps of a table, use FUSE_SNAPSHOT.

-- Restore with a snapshot ID
ALTER TABLE <table> FLASHBACK TO (SNAPSHOT => '<snapshot-id>');
-- Restore with a snapshot timestamp
ALTER TABLE <table> FLASHBACK TO (TIMESTAMP => '<timestamp>'::TIMESTAMP);

Learn More

PR | alter table flashback

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Adding Build Information to Error Report

An error report currently only contains an error code and some information about why the error occurred. When build information is available, troubleshooting will become easier.

"Code: xx. Error: error msg... (version ...)"

Issue 9117: Add Build Information to the Error Report

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	b41sh	BohuTANG	dantengsky	drmingdrmer	everpcpc


lichuang	mergify[bot]	PsiACE	RinChanNOWWW	sandflee	soyeric128


sundy-li	TCeason	Xuanwo	xudong963	youngsofun	zhang2014


ZhiHanZ

This Week in Databend #70

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Format

better checking of format options (#8981)
add basic schema infer for parquet (#9043)

Query

QualifiedName support 'db.table.' and 'table.' (#8965)
support bulk insert without expression (#8966)

Storage

add cache layer for fuse engine (#8830)
add system table system.memory_statistics (#8945)
add optimize statistic ddl support (#8891)

Code Refactoring 🎉

Base

remove common macros (#8936)

Format

TypeDeserializer get rid of FormatSetting (#8950)

Planner

refactor extract or predicate (#8951)

Processors

optimize join by merging build data block (#8961)

New Expression

allow sparse column id in chunk, redo #8789 with a new approach. (#9008)

Documentation 📔

i18n support with crowdin (#8987, #8997, etc.)

Bug Fixes 🔧

Base

try fix lost tracker (#8932)

Meta

fix share db bug, create DatabaseIdToName if need (#9006)

Mysql handler

fix mysql conns leak (#8894)

Processors

try fix update list memory leak (#9023)

Storage

read and write block in parallel when compact (#8921)

What's On In Databend

Stay connected with the latest news about Databend.

Infer Schema at a Glance

You usually need to create a table before loading data from a file stored on a stage or somewhere. Unfortunately, sometimes you might not know the file schema to create the table or are unable to input the schema due to its complexity.

Introducing the capability to infer schema from an existing file will make the work much easier. You will even be able to query data directly from a stage using a SELECT statement like select * from @my_stage.

INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV );
+-------------+---------+----------+
| COLUMN_NAME | TYPE    | NULLABLE |
|-------------+---------+----------|
| CONTINENT   | TEXT    | True     |
| COUNTRY     | VARIANT | True     |
+-------------+---------+----------+

We've added support for inferring the basic schema from parquet files in #9043, and we're now working on #7211 to implement select from @stage.

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add Tls Support for Mysql Handler

opensrv-mysql v0.3.0 that was released recently includes support for TLS. It sounds like a good idea to introduce it to Databend.

let (is_ssl, init_params) = opensrv_mysql::AsyncMysqlIntermediary::init_before_ssl(
    &mut shim,
    &mut r,
    &mut w,
    &Some(tls_config.clone()),
)
.await
.unwrap();

opensrv_mysql::secure_run_with_options(shim, w, ops, tls_config, init_params).await

Issue 8983: Feature: tls support for mysql handler

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.


andylokandy	ariesdevil	b41sh	BohuTANG	dantengsky	drmingdrmer


everpcpc	flaneur2020	leiysky	lichuang	mergify[bot]	PsiACE


sandflee	soyeric128	sundy-li	TCeason	TracyZYJ	Xuanwo


xudong963	youngsofun	yufan022	zhang2014	zhyass

What's New​

Enhancing Full-Text Search​

What's New​

Supporting EXECUTE IMMEDIATE​

What's New​

Supporting CHANGES Clause​

What's New​

Supporting Multi-Table Insert​

What's New​

Adding Support for Query Matching Inverted Index​

What's New​

Understanding Tasks and Notifications in Databend​

What's New​

Supporting Query Queue​

What's New​

Supporting Multi-Statement Transactions​

What's New​

Understanding Engine Support for Open Table Formats​

What's New​

Supporting CREATE [ OR REPLACE ]​

What's New​

Supporting GEO Data Types​

What's New​

Supporting CREATE OR REPLACE DATABASE​

What's New​

Supporting Standard Streams​

What's New​

Querying Data on HuggingFace File System with Databend​

What's New​

Understanding Databend's Access Control​

What's New​

New Filter Execution Framework​

What's New​

Password Policy​

What's New​

Compatibility Added for Delta and Iceberg Table Engines​

Code Corner​

ClickHouse HTTP Handler Disabled by Default​

Replacing Bincode with Borsh​

Highlights​

What's Up Next​

Introducing Password Policy​

Changelog​

What's New​

MERGE INTO now Shows Statistics​

Code Corner​

Adding Custom Clippy Rules with clippy.toml​

Highlights​

What's Up Next​

Adding Support for More Data File Types to INFER_SCHEMA​

Changelog​

What's New​

Supporting Chain Function Calls​

Code Corner​

Load Balancing with Query ID and Nginx​

Highlights​

What's Up Next​

Supporting for Calculating JOIN ORDER Using Greedy Algorithm​

Changelog​

What's New​

Supporting Append-only Stream​

Code Corner​

Switching from parquet2 to parquet-rs​

Highlights​

What's Up Next​

Supporting Iceberg and Delta Tables with Table Engine​

Changelog​

What's New​

Managing Storage Access with CONNECTION​

Code Corner​

opensrv v0.5.0 Released​

Highlights​

What's Up Next​

Summary Support for Dry Run of VACCUUM TABLE​

Changelog​

What's New​

User Case: Databend Cloud Helped AIGC Startup Reduce Cost​

Code Corner​

databend-udf Released, Databend Cloud Officially Supports External Functions​

Highlights​

What's New

Enhancing Full-Text Search

What's New

Supporting `EXECUTE IMMEDIATE`

What's New

Supporting `CHANGES` Clause

What's New

Supporting Multi-Table Insert

What's New

Adding Support for Query Matching Inverted Index

What's New

Understanding Tasks and Notifications in Databend

What's New

Supporting Query Queue

What's New

Supporting Multi-Statement Transactions

What's New

Understanding Engine Support for Open Table Formats

What's New

Supporting `CREATE [ OR REPLACE ]`

What's New

Supporting GEO Data Types

What's New

Supporting `CREATE OR REPLACE DATABASE`

What's New

Supporting Standard Streams

What's New

Querying Data on HuggingFace File System with Databend

What's New

Understanding Databend's Access Control

What's New

New Filter Execution Framework

What's New

Password Policy

What's New

Compatibility Added for Delta and Iceberg Table Engines

Code Corner

ClickHouse HTTP Handler Disabled by Default

Replacing Bincode with Borsh

Highlights

What's Up Next

Introducing Password Policy

Changelog

What's New

MERGE INTO now Shows Statistics

Code Corner

Adding Custom Clippy Rules with clippy.toml

Highlights

What's Up Next

Adding Support for More Data File Types to `INFER_SCHEMA`

Changelog

What's New

Supporting Chain Function Calls

Code Corner

Load Balancing with Query ID and Nginx

Highlights

What's Up Next

Supporting for Calculating JOIN ORDER Using Greedy Algorithm

Changelog

What's New

Supporting Append-only Stream

Code Corner

Switching from parquet2 to parquet-rs

Highlights

What's Up Next

Supporting Iceberg and Delta Tables with Table Engine

Changelog

What's New

Managing Storage Access with CONNECTION

Code Corner

opensrv v0.5.0 Released

Highlights

What's Up Next

Summary Support for Dry Run of VACCUUM TABLE

Changelog

What's New

User Case: Databend Cloud Helped AIGC Startup Reduce Cost

Code Corner

databend-udf Released, Databend Cloud Officially Supports External Functions

Highlights