Blog

Databend Monthly Report (July 2025)

avatarDatabendLabsAug 6, 2025
Databend Monthly Report (July 2025)

Hey Databend community! 🚀

July was a big month for us. We've been heads-down optimizing one thing: JSON query performance. The main story? We've turbocharged our Virtual Columns feature and the results are pretty impressive - 3x faster JSON queries with 26x less data scanning.

By the Numbers

This month: 25+ new features, 25+ bug fixes, 20+ performance optimizations, and 35+ other improvements. But honestly, the JSON performance work is what I'm most excited about.

Monthly Highlights Summary

🔥 Major Improvements

  • Virtual Columns performance breakthrough - 3x faster JSON queries, 26x less data scanning
  • Enhanced RBAC - now supports connection and sequence objects
  • Shuffle sort optimization - much better performance for large ordered datasets
  • Stream processing improvements - virtual columns now work with streaming data
  • Workload management - memory percentage quotas for better resource control

🛠 Developer Experience

  • 40+ JSON functions - comprehensive toolkit for JSON operations
  • Multiple access patterns - both Snowflake and PostgreSQL syntax work
  • Enhanced UDFs - IMMUTABLE support for better performance
  • Better debugging - improved metactl tools

Performance & Infrastructure

  • Meta-service improvements - millisecond precision, better architecture
  • Query optimizations - including grouping sets rewrites
  • Memory management - percentage-based quotas

🐛 Stability

  • 25+ bug fixes - JSON handling, decimal ops, query execution
  • Better error handling - edge cases and concurrent operations
  • Transaction reliability - temporary tables and system history

What's New and Game-Changing

🚀 Virtual Columns: Automatic JSON Indexing

So here's what we did. We took our existing Virtual Columns feature and completely optimized it for JSON workloads. Think of it as automatic JSON indexing - no manual intervention required.

Virtual Columns work like smart, automatic indexes for your JSON data. When you load JSON, we automatically:

  1. Analyze access patterns in your JSON documents
  2. Create optimized indexes for commonly-accessed paths
  3. Route queries to these fast indexes instead of execute the jsonb function every time

It's like having a database admin who never sleeps, constantly creating the perfect indexes for your JSON queries.

Example JSON:

{
"customer_id": 123,
"order": {
"items": [{ "name": "Shoes", "price": 59.99 }]
},
"user": { "name": "Alice", "email": "[email protected]" }
}

Automatic indexes created for:

  • customer_id
    → Fast integer index
  • order.items[0].price
    → Fast numeric index
  • user.name
    → Fast string index

Zero configuration. Zero maintenance. Just faster queries.

📊 Real Numbers

Here's a real example from our testing:

Before (traditional JSON parsing):

SELECT data['account_balance'], data['address']['city']
FROM user_logs;
-- 3.76 seconds, 11.90 GB processed

After (automatic Virtual Column indexes):

-- Exact same query
SELECT data['account_balance'], data['address']['city']
FROM user_logs;
-- 1.32 seconds, 461 MB processed

That's 3x faster and 26x less data. And you don't change a single line of code.

🔧 All Your JSON Syntax Works

We support the syntax you're already using:

Snowflake style:

SELECT data['user']['name'] FROM customers;       -- brackets
SELECT data:user:name FROM customers; -- colons
SELECT data['user'].profile.name FROM customers; -- mixed

PostgreSQL style:

SELECT data->'user'->'name' FROM customers;       -- arrows

All of it gets the same automatic indexing performance boost.

What's Next: One Architecture, Continuous Evolution

July's JSON optimization is actually part of something much bigger we're building. We're working toward mastering all types of data with one unified architecture:

Our multi-modal data approach:

  • Structured data - World-class columnar performance (already there)
  • Semi-structured data - Automatic Virtual Column indexing (July milestone)
  • 🚀 Unstructured data - Advanced vectorization with predicate pushdown (August preview)

🔮 August Preview: Unstructured Data Storage & Compute

While July was all about automatic indexing for semi-structured JSON, August is going to be about unstructured data. We're launching comprehensive capabilities for vector storage, compute, and query acceleration:

What's coming in August:

  • Vectorized storage - Native vector data types with optimized columnar storage
  • HNSW indexing - Hardware-accelerated similarity search with sub-second query times
  • Cross-modal SQL joins - Query across structured, semi-structured, and unstructured data in single queries

Imagine queries like this:

-- Find similar products using vector embeddings
-- AND analyze their sales performance from structured data
-- AND extract sentiment from JSON reviews
-- All in one SQL query

SELECT p.product_name,
sales.total_revenue,
reviews.data['sentiment']['score'] as sentiment,
COSINE_SIMILARITY(p.embedding, :search_vector) as similarity
FROM products p
JOIN sales_data sales ON p.id = sales.product_id -- structured
JOIN product_reviews reviews ON p.id = reviews.product_id -- semi-structured JSON
WHERE COSINE_SIMILARITY(p.embedding, :search_vector) > 0.8 -- unstructured vectors
AND reviews.data['sentiment']['score'] > 0.7
ORDER BY similarity DESC, total_revenue DESC;

This is one platform for the AI era - where structured analytics, JSON document queries, and vector similarity search all work together seamlessly. No data movement, no separate systems, just SQL that works across all your data types.

Early feedback from production users has been really encouraging. People are seeing immediate performance improvements on their JSON workloads while getting positioned for AI use cases down the road.

Check out what we're building at https://github.com/databendlabs/databend.

Thanks for following along!


The Databend Team

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!