Blog

Databend Data Ingestion Performance Test: From 3K rows/s to 2.3M rows/s - How to Choose Among 4 Methods

avatarwubxJan 4, 2026
Databend Data Ingestion Performance Test: From 3K rows/s to 2.3M rows/s - How to Choose Among 4 Methods

As a cloud-native lakehouse, Databend supports both on-premises and cloud deployments, offering 4 different data ingestion methods. This article explores the performance differences between these methods through testing and attempts to understand the design rationale behind them, helping you choose the most suitable ingestion method for your business scenario.

Databend's Supported Ingestion Methods

Databend provides 4 data ingestion methods suitable for different deployment scenarios:

1. INSERT (Bulk Insert with Presign)

Object storage can be directly accessed by applications, supporting presign mode. Suitable for cloud scenarios, taking full advantage of object storage's feature of not charging for data ingestion traffic.

Short name:

insert

2. INSERT_NO_PRESIGN

Object storage cannot be directly accessed by applications; data is forwarded through Databend nodes. Suitable for on-premises scenarios.

Short name:

insert_no_presign

3. STREAMING_LOAD

Supports direct streaming ingestion of common formats like CSV, NDJSON, and Parquet. This is a highly accepted approach for on-premises scenarios. Suitable for real-time data ingestion.

Short name:

streaming_load

4. STAGE_LOAD (Copy Into)

Batch loading based on object storage, capable of processing CSV, NDJSON, Parquet, ORC files and their compressed formats. This is the highest-performance cloud-native ingestion method.

Short name:

stage_load

💡 Tip: All four methods can be directly implemented via Java using

databend-jdbc

💻 Code Examples

1. INSERT (Bulk Insert)

Use Case: Cloud scenarios where object storage is accessible to applications

Properties props = new Properties();
props.setProperty("user", "root");
props.setProperty("password", "password");

try (Connection conn = DriverManager.getConnection("jdbc:databend://host:8000/db", props)) {
conn.setAutoCommit(false);

String sql = "INSERT INTO bench_insert (id, name, birthday, address, ...) VALUES (?, ?, ?, ?, ...)";

try (PreparedStatement ps = conn.prepareStatement(sql)) {
// Batch insert
for (int i = 0; i < batchSize; i++) {
ps.setLong(1, i);
ps.setString(2, "name_" + i);
ps.setString(3, "2024-01-01");
ps.setString(4, "address_" + i);
// ... Set other fields
ps.addBatch();
}

ps.executeBatch(); // Execute batch insert
}
}

Features:

  • Uses presign URLs, data is written directly to object storage
  • Suitable for cloud scenarios, reducing network costs

2. INSERT_NO_PRESIGN

Use Case: On-premises scenarios where object storage cannot be directly accessed by applications

Properties props = new Properties();
props.setProperty("user", "root");
props.setProperty("password", "password");
props.setProperty("presigned_url_disabled", "true"); // Key configuration: disable presign

try (Connection conn = DriverManager.getConnection("jdbc:databend://host:8000/db", props)) {
conn.setAutoCommit(false);

String sql = "INSERT INTO bench_insert (id, name, birthday, ...) VALUES (?, ?, ?, ...)";

try (PreparedStatement ps = conn.prepareStatement(sql)) {
for (int i = 0; i < batchSize; i++) {
ps.setLong(1, i);
ps.setString(2, "name_" + i);
// ... Set other fields
ps.addBatch();
}

ps.executeBatch();
}
}

Features:

  • Simply add
    presigned_url_disabled=true
    to the connection string, no other changes needed
  • Data is forwarded to object storage through Databend nodes

3. STREAMING_LOAD

Use Case: Real-time ingestion in on-premises scenarios, supporting multiple formats

import com.databend.jdbc.DatabendConnection;
import com.databend.jdbc.DatabendConnectionExtension;

Properties props = new Properties();
props.setProperty("user", "root");
props.setProperty("password", "password");

try (Connection conn = DriverManager.getConnection("jdbc:databend://host:8000/db", props)) {
DatabendConnection databendConn = conn.unwrap(DatabendConnection.class);

// Construct CSV data
String csvData = """
1,batch_1,name_1,2024-01-01,address_1,...
2,batch_1,name_2,2024-01-01,address_2,...
3,batch_1,name_3,2024-01-01,address_3,...
""";

byte[] payload = csvData.getBytes(StandardCharsets.UTF_8);

try (InputStream in = new ByteArrayInputStream(payload)) {
String sql = "INSERT INTO bench_insert FROM @_databend_load FILE_FORMAT=(type=CSV)";

int loaded = databendConn.loadStreamToTable(
sql,
in,
payload.length,
DatabendConnectionExtension.LoadMethod.STREAMING
);

System.out.println("Loaded rows: " + loaded);
}
}

Features:

  • Uses special stage
    @_databend_load
    for direct streaming loading
  • Supports CSV/NDJSON/Parquet/Orc formats
  • Throughput rapidly improves with larger batches

4. STAGE_LOAD (Copy Into)

Use Case: Large-scale data loading with highest performance

import org.apache.opendal.AsyncOperator;

// 1. Create S3 Operator (using OpenDAL)
Map<String, String> conf = new HashMap<>();
conf.put("bucket", "my-bucket");
conf.put("endpoint", "http://s3-endpoint:9000");
conf.put("access_key_id", "access_key");
conf.put("secret_access_key", "secret_key");

try (AsyncOperator op = AsyncOperator.of("s3", conf);
Connection conn = DriverManager.getConnection("jdbc:databend://host:8000/db", user, password)) {

// 2. Create External Stage
String createStage = """
CREATE STAGE IF NOT EXISTS my_stage
URL='s3://my-bucket/data/'
CONNECTION=(
endpoint_url = 'http://s3-endpoint:9000'
access_key_id = 'access_key'
secret_access_key = 'secret_key'
)
""";
conn.createStatement().execute(createStage);

// 3. Write data to object storage
String csvData = "1,name1,2024-01-01,...\n2,name2,2024-01-01,...\n";
byte[] payload = csvData.getBytes(StandardCharsets.UTF_8);
op.write("batch_001.csv", payload).join();

// 4. Use COPY INTO for batch loading
String copySql = """
COPY INTO bench_insert
FROM @my_stage
PATTERN='.*\\.csv'
FILE_FORMAT=(type=CSV)
PURGE=TRUE
""";
conn.createStatement().execute(copySql);
}

Features:

  • Data is written to object storage first, then batch loaded, fully leveraging object storage's parallel capabilities
  • Saves network costs in cloud scenarios (object storage read/write is free)

Test Plan

  • Platform: A domestic trusted computing environment
  • Tools: Java + databend-jdbc

The load testing program was auto-generated with AI assistance. For details, refer to the scripts: https://github.com/wubx/databend_ingestion/tree/main/db_ingestion

Table Schema

  • Data: Mock data with 23 fields (including VARCHAR, DATE, TIMESTAMP types)
                CREATE OR REPLACE TABLE bench_insert (
id BIGINT,
batch VARCHAR,
name VARCHAR,
birthday DATE,
address VARCHAR,
company VARCHAR,
job VARCHAR,
bank VARCHAR,
password VARCHAR,
phone_number VARCHAR,
user_agent VARCHAR,
c1 VARCHAR,
c2 VARCHAR,
c3 VARCHAR,
c4 VARCHAR,
c5 VARCHAR,
c6 VARCHAR,
c7 VARCHAR,
c8 VARCHAR,
c9 VARCHAR,
c10 VARCHAR,
d DATE,
t TIMESTAMP
)

Load Testing

for bt in  2000 3000 4000 5000 10000 20000 30000 40000 50000 100000 200000
do
for type in insert insert_no_presign streaming_load stage_load
do
echo $bt $type
java -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -jar ./db_ingestion-1.0-SNAPSHOT-shaded.jar $type 2000000 $bt
done
done

Performance Analysis

INSERT vs STREAMING_LOAD Performance Comparison

insert_streaimg_load analysis

  • STREAMING_LOAD's DB throughput rapidly improves with larger batches, from 6,958 rows/s with 2k batches to 127,643 rows/s with 200k batches, significantly outperforming INSERT (29,283 rows/s) and INSERT_NO_PRESIGN (26,121 rows/s) at the same batch sizes.
  • INSERT and INSERT_NO_PRESIGN show acceptable differences at small batches (3,670 vs 3,929 rows/s at batch=2k), but they almost converge at ≥20k batches, and are consistently outperformed by the streaming approach.

Key Findings:

  • STREAMING_LOAD DB throughput rapidly improves with larger batches:
    • 2k batch: 6,958 rows/s
    • 200k batch: 127,643 rows/s
  • INSERT and INSERT_NO_PRESIGN show similar performance:
    • Small difference at small batches (2k): 3,670 vs 3,929 rows/s
    • Almost converge at ≥20k batches, with INSERT WITH Presign mode performing better as batch size increases
    • At 200k batch: INSERT ~29,283 rows/s, INSERT_NO_PRESIGN ~26,121 rows/s
  • Streaming_load overall outperforms INSERT and INSERT_NO_PRESIGN

STAGE_LOAD (Copy Into) Performance

  • STAGE_LOAD's COPY throughput is at a different magnitude: reaching 800k rows/s at batch=20k, stabilizing at 694k rows/s at 50k batches. Total time is more affected by stage writing + copy startup overhead, but single COPY throughput is unmatched.

Key Findings:

  • COPY INTO is susceptible to network bandwidth bottlenecks
  • Total time is affected by stage writing + copy startup overhead, but single COPY throughput is unmatched

Comprehensive Comparison of Four Methods

Comparing all four methods together, stage_load (copy into) is clearly the leader. In actual production environments with a 3-node cluster and sufficient data sources, COPY INTO can achieve ingestion speeds of 2.3M+ rows/second.

For copy into (stage_load) method, with sufficient data sources, achieving 1M+ rows per second is not a problem. For example: In a 3-node cluster with sufficient data: 2.3M+ rows/s ingestion speed. 10000 rows read in 84.816 sec. Processed 200.19 million rows, 143.15 GiB (2.36 million rows/s, 1.69 GiB/s)

Quick Comparison Table

MethodThroughputUse CaseCloud CostImplementation
insert⭐⭐CloudLowSimple
insert_no_presign⭐⭐On-premises-Simple
streaming_load⭐⭐⭐On-premises real-timeHighMedium
stage_load (copy into)⭐⭐⭐⭐⭐Large-scale batch loadLowMedium

How to Choose the Right Method

After reviewing the data above, everyone will ask: How should I choose? Should I pursue the highest-performing stage_load (copy into), or opt for convenience? For cloud scenarios, you must also consider "how to save money" (money-oriented programming is a required course for cloud development).

On-Premises Databend Data Ingestion Method Selection Guide

In on-premises environments, I usually recommend "whatever is convenient," as the first principle is to help developers deliver faster and make the business more stable. From the graphs above, you can see that all methods can easily achieve 30,000+ rows per second, and for massive data, millions of rows per second can be easily ingested, meeting business requirements.

  • Without other constraints, keep it simple, but try to batch writes to reduce Databend's compaction overhead
  • When object storage cannot be directly accessed by applications, choose between streaming_load or insert_no_presign. Between the two, insert_no_presign only requires adding
    presigned_url_disabled=true
    to the connection string with no other configuration changes. If performance is still insufficient, consider streaming_load.
  • When a single machine's network card becomes a bottleneck, use stage_load with OpenDAL to write data to object storage first, then batch load; you can use multiple Databend nodes for parallel loading, and object storage itself can scale horizontally, easily achieving millions of rows per second.

Cloud Databend Data Ingestion Method Selection Guide

On-premises prioritizes convenience, while cloud prioritizes cost savings under performance requirements. Here, I recommend excluding insert_no_presign and streaming_load, as both may trigger cross-VPC traffic charges with higher costs. For cloud, I recommend:

  • Bulk insert
  • stage_load (copy into)

Both methods leverage object storage to temporarily store data before writing, taking advantage of object storage's feature of not charging for writes from any location, helping users reduce cloud costs. In cloud environments, users often leverage Databend's task + copy into to achieve second-level loading, implementing near real-time data ingestion.

⚠️ Cloud Pitfall Warning:

Not Recommended: INSERT_NO_PRESIGN and STREAMING_LOAD

  • Reason: These two methods may trigger cross-VPC traffic charges with higher costs

When helping cloud users build or transform lake warehouses, Databend has found that many customers have surprisingly high traffic costs for data lake ingestion and reading, which can be avoided. If you find your cloud traffic costs are also high, you can contact Databend for analysis.

Best Practice:

In the cloud, users often leverage Databend's task + copy into to achieve second-level data loading, implementing near real-time data ingestion.

-- Example: Create a scheduled task to load data every minute
CREATE TASK auto_ingest_task
WAREHOUSE = 'default'
SCHEDULE = 1 MINUTE
AS
COPY INTO target_table
FROM @my_stage
PATTERN = '.*\.csv'
FILE_FORMAT = (type = CSV)
PURGE = TRUE;

This approach fully leverages cloud object storage's feature of not charging for data ingestion, and also benefits from Databend's feature of not charging traffic fees when reading from object storage in the same region.

Conclusion

This load testing demonstrates that Databend's different ingestion methods form a clear gradient between performance, cost, and implementation complexity: streaming_load achieves a better balance between real-time performance and throughput, while stage_load has absolute advantages in throughput and can reduce cloud costs through object storage. Therefore, when implementing solutions, it's recommended to first evaluate business real-time requirements, network topology, and budget constraints, then choose between INSERT, STREAMING_LOAD, and STAGE_LOAD. Through proper batch settings and automated task scheduling, you can achieve stable ingestion capabilities while ensuring optimal ROI.

From on-premises to cloud, the decision logic for data ingestion has changed: it's no longer just about performance competition, but about finding the optimal balance between performance and cost. Choosing the right method not only ensures performance but can also significantly reduce cloud costs. After reading the above content, how would you choose among Databend's 4 ingestion methods?

If this article helps you, please:

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!