Hey Databend community! 🚀
November delivered 17 nightly releases focused on delivering richer data experiences. We shipped TimestampTz support, HTTP Arrow payloads, RBAC-aware masking policies, three rounds of runtime filter tuning, and inverted-index/VECTOR improvements for multimodal workloads. Query interfaces now support a wide range of formats—from JSON to Arrow IPC to geometry outputs—enabling analytics teams to build more versatile applications faster.
By the Numbers
27 new features, 28 bug fixes, 18 refactors, 6 CI/build improvements, and 27 quality-of-life updates landed between v1.2.834-nightly and v1.2.850-nightly.
Monthly Highlights
🔥 Major New Features
- TimestampTz & ANSI date keywords — New ,
TimestampTz,CURRENT_DATE, andCURRENT_TIMEsemantics keep temporal pipelines portable across regionsDATE ± INTERVAL - Arrow over HTTP + geometry metadata — Query interface can return Arrow IPC bodies with timezone and geometry format hints for low-latency BI consumption
- UDTF server runtime — User-defined table functions run as managed services with transport diagnostics and Python import caching
- Position-aware SQL — Standard columns are now queryable in SQL-compatible flows, easing migrations from legacy engines
POSITION - System streams telemetry — exposes
system.streamsplus richer verification output to debug ingestionhas_data
🧠 Multimodal Query & Analytics
- Inverted index TOP-N pruning + caching — ORDER BY/LIMIT queries over JSON/Variant text now skip irrelevant segments, combined with refreshed vector quantization scoring
- Decimal64 & Arrow 56 stack — Native-format tables deserialize Decimal64 columns directly and upgraded Arrow/Parquet crates power consistent schema evolution
- Bitmap utilities — plus new runtime filter mechanisms unlock faster set analytics and cross-modal joins
bitmap_to_array - Table statistics admin API — Programmatic stats refresh and policy_reference table function build blocks for adaptive query planning
- Virtual column everywhere — External tables, selection queries, and LIMIT pipelines now respect virtual-column calculations
🛡️ Governance & Trust
- Masking policy upgrades — Multi-column clauses, RBAC enforcement, conflict detection, and drop protection keep privacy rules consistent
USING - Policy observability — table function documents where masking/row access policies are applied for quick compliance audits
policy_reference - Secure defaults — Sequence lookups, JWT bindings, and MySQL TLS handshakes received guards to prevent privilege escalation or partial sessions
⚙️ Performance & Operations
- Runtime filters (Parts 1–3) — Adaptive filters now tolerate nullable casts, capture timing, and avoid redundant casts, shaving seconds off distributed joins
- Synchronous Backpressure Spilling — Spilling decisions are now handled synchronously to keep memory under control without noisy logs
- Parquet dictionary toggle & timezone consistency — Fuse tables gain ; query timezone propagation was added end-to-end (HTTP headers, session state, clamp helpers)
enable_parquet_dictionary - Meta-service resilience — Wider logging around aggregate spill, opendal 0.54.1 bump, and refined fetch-add paths reduce hot spots in busy clusters
🐛 Reliability & Quality
- CSV ingest, pivot projections, and aggregate spill regressions were patched;
- COUNT DISTINCT, ZIP import, and bitmap array edge cases now behave deterministically;
- Password policy descriptors, record-batch conversion, and large JWKS/MariaDB sessions no longer throw runtime exceptions.
Spotlight: Multimodal Data Analytics (CityDrive Scenario)
One Databend database now powers every retrieval pattern—relational, JSON, vector, bitmap, geospatial—without spinning up separate services. This is the core value of the Multimodal Data Analytics guides, which center on the CityDrive Intelligence dataset. The same
video_id
frame_id
QUERY()
0. Automated Data Ingest
CityDrive exports each batch as Parquet. The Lakehouse ETL guide sets up a reusable stage and task-driven COPY pipeline that continuously ingests every modality:
-- Stage connection to CityDrive’s S3 bucket
CREATE OR REPLACE CONNECTION citydrive_s3
STORAGE_TYPE='s3'
ACCESS_KEY_ID='<AWS_ACCESS_KEY_ID>'
SECRET_ACCESS_KEY='<AWS_SECRET_ACCESS_KEY>';
-- Shared stage pointing at the raw multimodal exports
CREATE OR REPLACE STAGE citydrive_stage
URL='s3://citydrive-lakehouse/raw/'
CONNECTION=(CONNECTION_NAME='citydrive_s3')
FILE_FORMAT=(TYPE='PARQUET');
-- Autonomous ETL: load clip metadata every 10 minutes
CREATE OR REPLACE TASK task_load_citydrive_videos
WAREHOUSE='default'
SCHEDULE=10 MINUTE
AS
COPY INTO citydrive_videos (
video_id,
vehicle_id,
capture_date,
route_name,
weather,
camera_source,
duration_sec
) FROM (
SELECT
video_id::STRING,
vehicle_id::STRING,
capture_date::DATE,
route_name::STRING,
weather::STRING,
camera_source::STRING,
duration_sec::INT
FROM @citydrive_stage/videos/
) FILE_FORMAT=(TYPE='PARQUET');
-- Autonomous ETL: load frame events in lockstep
CREATE OR REPLACE TASK task_load_frame_events
WAREHOUSE='default'
SCHEDULE=10 MINUTE
AS
COPY INTO frame_events (
frame_id,
video_id,
frame_index,
collected_at,
event_tag,
risk_score,
speed_kmh
) FROM (
SELECT
frame_id::STRING,
video_id::STRING,
frame_index::INT,
collected_at::TIMESTAMP,
event_tag::STRING,
risk_score::DOUBLE,
speed_kmh::DOUBLE
FROM @citydrive_stage/frame-events/
) FILE_FORMAT=(TYPE='PARQUET');
-- Activate the pipeline so the warehouse refreshes automatically
ALTER TASK task_load_citydrive_videos RESUME;
ALTER TASK task_load_frame_events RESUME;
Streams and tasks can be layered on top if you want incremental refreshes, and TimestampTz columns preserve driving-session offsets across every load.
1. Relational SQL Analytics (Guide: SQL Analytics)
Base tables keep the canonical fleet metrics:
CREATE OR REPLACE TABLE citydrive_videos (...);
CREATE OR REPLACE TABLE frame_events (...);
WITH recent_videos AS (
SELECT * FROM citydrive_videos
WHERE capture_date >= '2025-01-01' AND capture_date < '2025-01-04'
)
SELECT v.video_id, v.route_name, COUNT(f.frame_id) AS flagged_frames
FROM recent_videos v
LEFT JOIN frame_events f USING(video_id)
GROUP BY v.video_id, v.route_name
ORDER BY flagged_frames DESC;
LATERAL FLATTEN
frame_metadata_catalog
ROLLUP
CUBE
2. JSON & Search (Guide: JSON & Search)
Inverted indexes allow Elasticsearch-style filtering without duplicating data:
CREATE OR REPLACE TABLE frame_metadata_catalog (
doc_id STRING,
meta_json VARIANT,
captured_at TIMESTAMP,
INVERTED INDEX idx_meta_json(meta_json)
);
SELECT doc_id, captured_at
FROM frame_metadata_catalog
WHERE QUERY('meta_json.scene.weather_code:rain AND meta_json.camera.sensor_view:roof')
ORDER BY captured_at;
The
policy_reference
3. Vector Search (Guide: Vector Search)
Semantic search runs beside SQL using shared identifiers:
CREATE OR REPLACE TABLE frame_embeddings (
frame_id STRING,
video_id STRING,
sensor_view STRING,
embedding VECTOR(512),
encoder_build STRING,
created_at TIMESTAMP,
VECTOR INDEX idx_frame_embeddings(embedding) DISTANCE='cosine'
);
WITH query_embedding AS (
SELECT embedding FROM frame_embeddings WHERE frame_id='FRAME-0101'
)
SELECT e.frame_id, e.video_id,
COSINE_DISTANCE(e.embedding, q.embedding) AS distance
FROM frame_embeddings e
CROSS JOIN query_embedding q
ORDER BY distance
LIMIT 3;
SIMD quantization, Decimal64-aware decoding, and vector-index telemetry reduce latency for these lookups, and runtime filtering lets you combine vector hits with relational filters without wasting scans.
4. Geo Analytics (Guide: Geo Analytics)
Geo tables use the same IDs, so “where did this happen?” is just another join:
CREATE OR REPLACE TABLE frame_geo_points (
video_id STRING,
frame_id STRING,
position_wgs84 GEOMETRY,
solution_grade INT,
source_system STRING,
created_at TIMESTAMP
);
SELECT f.frame_id, ST_DISTANCE(g.position_wgs84, s.signal_position) AS meters_to_signal
FROM frame_geo_points g
JOIN signal_contact_points s USING(frame_id)
JOIN frame_events f USING(frame_id)
WHERE ST_WITHIN(g.position_wgs84,
TO_GEOMETRY('SRID=4326;POLYGON((114.05 22.54, 114.13 22.54, 114.13 22.57, 114.05 22.57, 114.05 22.54))'));
HTTP geometry metadata simplifies the process for GIS apps to render these shapes, while timezone-aware columns keep timestamps aligned with the driving corridor.
Tie it all together with one SQL statement that filters JSON labels, finds vector neighbors, and checks distance-to-signal—no extra systems required:
WITH json_hits AS (
SELECT doc_id
FROM frame_metadata_catalog
WHERE QUERY('meta_json.media_meta.tagging.labels:pedestrian')
),
vector_hits AS (
SELECT frame_id
FROM frame_embeddings
WHERE COSINE_DISTANCE(
embedding,
(SELECT embedding FROM frame_embeddings WHERE frame_id='FRAME-0102')
) < 0.3
)
SELECT f.frame_id, v.route_name, s.distance_m
FROM frame_events f
JOIN citydrive_videos v USING(video_id)
JOIN json_hits j ON j.doc_id = f.frame_id
JOIN vector_hits vh ON vh.frame_id = f.frame_id
JOIN signal_contact_points s USING(frame_id);
This single query spans structured facts, JSON metadata, vector similarity, and geospatial distance—all executed in Databend. That’s unified multimodal retrieval.
📘 Follow the guides in order (SQL Analytics → JSON & Search → Vector Search → Geo Analytics) to recreate this exact end-to-end flow on your cluster. One warehouse, every search pattern.
Databend operates as a unified retrieval engine: classic SQL, JSON search, vector similarity, bitmap cohorts, geospatial filters, and lakehouse ETL all share the same tables, tasks, and policies. You can pull insights from CityDrive-style datasets without juggling extra services or pipelines.
Try it today: https://databend.com
The Databend Team Redefining what's possible with data
Subscribe to our newsletter
Stay informed on feature releases, product roadmap, support, and cloud offerings!



