Hey Databend community! 🚀
July was a big month for us. We've been heads-down optimizing one thing: JSON query performance. The main story? We've turbocharged our Virtual Columns feature and the results are pretty impressive - 3x faster JSON queries with 26x less data scanning.
By the Numbers
This month: 25+ new features, 25+ bug fixes, 20+ performance optimizations, and 35+ other improvements. But honestly, the JSON performance work is what I'm most excited about.
Monthly Highlights Summary
🔥 Major Improvements
- Virtual Columns performance breakthrough - 3x faster JSON queries, 26x less data scanning
- Enhanced RBAC - now supports connection and sequence objects
- Shuffle sort optimization - much better performance for large ordered datasets
- Stream processing improvements - virtual columns now work with streaming data
- Workload management - memory percentage quotas for better resource control
🛠 Developer Experience
- 40+ JSON functions - comprehensive toolkit for JSON operations
- Multiple access patterns - both Snowflake and PostgreSQL syntax work
- Enhanced UDFs - IMMUTABLE support for better performance
- Better debugging - improved metactl tools
⚡ Performance & Infrastructure
- Meta-service improvements - millisecond precision, better architecture
- Query optimizations - including grouping sets rewrites
- Memory management - percentage-based quotas
🐛 Stability
- 25+ bug fixes - JSON handling, decimal ops, query execution
- Better error handling - edge cases and concurrent operations
- Transaction reliability - temporary tables and system history
What's New and Game-Changing
🚀 Virtual Columns: Automatic JSON Indexing
So here's what we did. We took our existing Virtual Columns feature and completely optimized it for JSON workloads. Think of it as automatic JSON indexing - no manual intervention required.
Virtual Columns work like smart, automatic indexes for your JSON data. When you load JSON, we automatically:
- Analyze access patterns in your JSON documents
- Create optimized indexes for commonly-accessed paths
- Route queries to these fast indexes instead of execute the jsonb function every time
It's like having a database admin who never sleeps, constantly creating the perfect indexes for your JSON queries.
Example JSON:
{
"customer_id": 123,
"order": {
"items": [{ "name": "Shoes", "price": 59.99 }]
},
"user": { "name": "Alice", "email": "[email protected]" }
}
Automatic indexes created for:
- → Fast integer index
customer_id
- → Fast numeric index
order.items[0].price
- → Fast string index
user.name
Zero configuration. Zero maintenance. Just faster queries.
📊 Real Numbers
Here's a real example from our testing:
Before (traditional JSON parsing):
SELECT data['account_balance'], data['address']['city']
FROM user_logs;
-- 3.76 seconds, 11.90 GB processed
After (automatic Virtual Column indexes):
-- Exact same query
SELECT data['account_balance'], data['address']['city']
FROM user_logs;
-- 1.32 seconds, 461 MB processed
That's 3x faster and 26x less data. And you don't change a single line of code.
🔧 All Your JSON Syntax Works
We support the syntax you're already using:
Snowflake style:
SELECT data['user']['name'] FROM customers; -- brackets
SELECT data:user:name FROM customers; -- colons
SELECT data['user'].profile.name FROM customers; -- mixed
PostgreSQL style:
SELECT data->'user'->'name' FROM customers; -- arrows
All of it gets the same automatic indexing performance boost.
What's Next: One Architecture, Continuous Evolution
July's JSON optimization is actually part of something much bigger we're building. We're working toward mastering all types of data with one unified architecture:
Our multi-modal data approach:
- ✅ Structured data - World-class columnar performance (already there)
- ✅ Semi-structured data - Automatic Virtual Column indexing (July milestone)
- 🚀 Unstructured data - Advanced vectorization with predicate pushdown (August preview)
🔮 August Preview: Unstructured Data Storage & Compute
While July was all about automatic indexing for semi-structured JSON, August is going to be about unstructured data. We're launching comprehensive capabilities for vector storage, compute, and query acceleration:
What's coming in August:
- Vectorized storage - Native vector data types with optimized columnar storage
- HNSW indexing - Hardware-accelerated similarity search with sub-second query times
- Cross-modal SQL joins - Query across structured, semi-structured, and unstructured data in single queries
Imagine queries like this:
-- Find similar products using vector embeddings
-- AND analyze their sales performance from structured data
-- AND extract sentiment from JSON reviews
-- All in one SQL query
SELECT p.product_name,
sales.total_revenue,
reviews.data['sentiment']['score'] as sentiment,
COSINE_SIMILARITY(p.embedding, :search_vector) as similarity
FROM products p
JOIN sales_data sales ON p.id = sales.product_id -- structured
JOIN product_reviews reviews ON p.id = reviews.product_id -- semi-structured JSON
WHERE COSINE_SIMILARITY(p.embedding, :search_vector) > 0.8 -- unstructured vectors
AND reviews.data['sentiment']['score'] > 0.7
ORDER BY similarity DESC, total_revenue DESC;
This is one platform for the AI era - where structured analytics, JSON document queries, and vector similarity search all work together seamlessly. No data movement, no separate systems, just SQL that works across all your data types.
Early feedback from production users has been really encouraging. People are seeing immediate performance improvements on their JSON workloads while getting positioned for AI use cases down the road.
Check out what we're building at https://github.com/databendlabs/databend.
Thanks for following along!
The Databend Team
Subscribe to our newsletter
Stay informed on feature releases, product roadmap, support, and cloud offerings!