Blog

Databend Monthly Report: November 2025 — Multimodal Query Intelligence

DatabendLabsDec 4, 2025

Hey Databend community! 🚀

November delivered 17 nightly releases focused on delivering richer data experiences. We shipped TimestampTz support, HTTP Arrow payloads, RBAC-aware masking policies, three rounds of runtime filter tuning, and inverted-index/VECTOR improvements for multimodal workloads. Query interfaces now support a wide range of formats—from JSON to Arrow IPC to geometry outputs—enabling analytics teams to build more versatile applications faster.

By the Numbers

27 new features, 28 bug fixes, 18 refactors, 6 CI/build improvements, and 27 quality-of-life updates landed between v1.2.834-nightly and v1.2.850-nightly.

Monthly Highlights

🔥 Major New Features

TimestampTz & ANSI date keywords — New
```
TimestampTz
```
,
```
CURRENT_DATE
```
,
```
CURRENT_TIME
```
, and
```
DATE ± INTERVAL
```
semantics keep temporal pipelines portable across regions
Arrow over HTTP + geometry metadata — Query interface can return Arrow IPC bodies with timezone and geometry format hints for low-latency BI consumption
UDTF server runtime — User-defined table functions run as managed services with transport diagnostics and Python import caching
Position-aware SQL — Standard
```
POSITION
```
columns are now queryable in SQL-compatible flows, easing migrations from legacy engines
System streams telemetry —
```
system.streams
```
exposes
```
has_data
```
plus richer verification output to debug ingestion

🧠 Multimodal Query & Analytics

Inverted index TOP-N pruning + caching — ORDER BY/LIMIT queries over JSON/Variant text now skip irrelevant segments, combined with refreshed vector quantization scoring
Decimal64 & Arrow 56 stack — Native-format tables deserialize Decimal64 columns directly and upgraded Arrow/Parquet crates power consistent schema evolution
Bitmap utilities —
```
bitmap_to_array
```
plus new runtime filter mechanisms unlock faster set analytics and cross-modal joins
Table statistics admin API — Programmatic stats refresh and policy_reference table function build blocks for adaptive query planning
Virtual column everywhere — External tables, selection queries, and LIMIT pipelines now respect virtual-column calculations

🛡️ Governance & Trust

Masking policy upgrades — Multi-column
```
USING
```
clauses, RBAC enforcement, conflict detection, and drop protection keep privacy rules consistent
Policy observability —
```
policy_reference
```
table function documents where masking/row access policies are applied for quick compliance audits
Secure defaults — Sequence lookups, JWT bindings, and MySQL TLS handshakes received guards to prevent privilege escalation or partial sessions

⚙️ Performance & Operations

Runtime filters (Parts 1–3) — Adaptive filters now tolerate nullable casts, capture timing, and avoid redundant casts, shaving seconds off distributed joins
Synchronous Backpressure Spilling — Spilling decisions are now handled synchronously to keep memory under control without noisy logs
Parquet dictionary toggle & timezone consistency — Fuse tables gain
```
enable_parquet_dictionary
```
; query timezone propagation was added end-to-end (HTTP headers, session state, clamp helpers)
Meta-service resilience — Wider logging around aggregate spill, opendal 0.54.1 bump, and refined fetch-add paths reduce hot spots in busy clusters

🐛 Reliability & Quality

CSV ingest, pivot projections, and aggregate spill regressions were patched;
COUNT DISTINCT, ZIP import, and bitmap array edge cases now behave deterministically;
Password policy descriptors, record-batch conversion, and large JWKS/MariaDB sessions no longer throw runtime exceptions.

Spotlight: Multimodal Data Analytics (CityDrive Scenario)

One Databend database now powers every retrieval pattern—relational, JSON, vector, bitmap, geospatial—without spinning up separate services. This is the core value of the Multimodal Data Analytics guides, which center on the CityDrive Intelligence dataset. The same

video_id

frame_id

columns power SQL filters, Elasticsearch-style

QUERY()

, HNSW vector search, and GEOMETRY joins, so analysts stay inside a single warehouse. This spotlight revisits that scenario step by step to show how unified retrieval works in practice.

0. Automated Data Ingest

CityDrive exports each batch as Parquet. The Lakehouse ETL guide sets up a reusable stage and task-driven COPY pipeline that continuously ingests every modality:

-- Stage connection to CityDrive’s S3 bucket
CREATE OR REPLACE CONNECTION citydrive_s3
    STORAGE_TYPE='s3'
    ACCESS_KEY_ID='<AWS_ACCESS_KEY_ID>'
    SECRET_ACCESS_KEY='<AWS_SECRET_ACCESS_KEY>';

-- Shared stage pointing at the raw multimodal exports
CREATE OR REPLACE STAGE citydrive_stage
    URL='s3://citydrive-lakehouse/raw/'
    CONNECTION=(CONNECTION_NAME='citydrive_s3')
    FILE_FORMAT=(TYPE='PARQUET');

-- Autonomous ETL: load clip metadata every 10 minutes
CREATE OR REPLACE TASK task_load_citydrive_videos
    WAREHOUSE='default'
    SCHEDULE=10 MINUTE
AS
COPY INTO citydrive_videos (
    video_id,
    vehicle_id,
    capture_date,
    route_name,
    weather,
    camera_source,
    duration_sec
) FROM (
    SELECT
        video_id::STRING,
        vehicle_id::STRING,
        capture_date::DATE,
        route_name::STRING,
        weather::STRING,
        camera_source::STRING,
        duration_sec::INT
    FROM @citydrive_stage/videos/
) FILE_FORMAT=(TYPE='PARQUET');

-- Autonomous ETL: load frame events in lockstep
CREATE OR REPLACE TASK task_load_frame_events
    WAREHOUSE='default'
    SCHEDULE=10 MINUTE
AS
COPY INTO frame_events (
    frame_id,
    video_id,
    frame_index,
    collected_at,
    event_tag,
    risk_score,
    speed_kmh
) FROM (
    SELECT
        frame_id::STRING,
        video_id::STRING,
        frame_index::INT,
        collected_at::TIMESTAMP,
        event_tag::STRING,
        risk_score::DOUBLE,
        speed_kmh::DOUBLE
    FROM @citydrive_stage/frame-events/
) FILE_FORMAT=(TYPE='PARQUET');

-- Activate the pipeline so the warehouse refreshes automatically
ALTER TASK task_load_citydrive_videos RESUME;
ALTER TASK task_load_frame_events RESUME;

Streams and tasks can be layered on top if you want incremental refreshes, and TimestampTz columns preserve driving-session offsets across every load.

1. Relational SQL Analytics (Guide: SQL Analytics)

Base tables keep the canonical fleet metrics:

CREATE OR REPLACE TABLE citydrive_videos (...);
CREATE OR REPLACE TABLE frame_events (...);

WITH recent_videos AS (
    SELECT * FROM citydrive_videos
    WHERE capture_date >= '2025-01-01' AND capture_date < '2025-01-04'
)
SELECT v.video_id, v.route_name, COUNT(f.frame_id) AS flagged_frames
FROM recent_videos v
LEFT JOIN frame_events f USING(video_id)
GROUP BY v.video_id, v.route_name
ORDER BY flagged_frames DESC;

LATERAL FLATTEN

joins

frame_metadata_catalog

for nested detections, while

ROLLUP

CUBE

summarize risk per route and weather. Recent runtime filter enhancements ensure these joins remain responsive even as fleets scale out.

2. JSON & Search (Guide: JSON & Search)

Inverted indexes allow Elasticsearch-style filtering without duplicating data:

CREATE OR REPLACE TABLE frame_metadata_catalog (
    doc_id STRING,
    meta_json VARIANT,
    captured_at TIMESTAMP,
    INVERTED INDEX idx_meta_json(meta_json)
);

SELECT doc_id, captured_at
FROM frame_metadata_catalog
WHERE QUERY('meta_json.scene.weather_code:rain AND meta_json.camera.sensor_view:roof')
ORDER BY captured_at;

The

policy_reference

function plus masking-policy controls mean you can apply RBAC-aware privacy to any columns that mirror these JSON payloads, and optional Arrow-over-HTTP responses stream the filtered payloads directly into BI tools.

3. Vector Search (Guide: Vector Search)

Semantic search runs beside SQL using shared identifiers:

CREATE OR REPLACE TABLE frame_embeddings (
    frame_id STRING,
    video_id STRING,
    sensor_view STRING,
    embedding VECTOR(512),
    encoder_build STRING,
    created_at TIMESTAMP,
    VECTOR INDEX idx_frame_embeddings(embedding) DISTANCE='cosine'
);

WITH query_embedding AS (
    SELECT embedding FROM frame_embeddings WHERE frame_id='FRAME-0101'
)
SELECT e.frame_id, e.video_id,
       COSINE_DISTANCE(e.embedding, q.embedding) AS distance
FROM frame_embeddings e
CROSS JOIN query_embedding q
ORDER BY distance
LIMIT 3;

SIMD quantization, Decimal64-aware decoding, and vector-index telemetry reduce latency for these lookups, and runtime filtering lets you combine vector hits with relational filters without wasting scans.

4. Geo Analytics (Guide: Geo Analytics)

Geo tables use the same IDs, so “where did this happen?” is just another join:

CREATE OR REPLACE TABLE frame_geo_points (
    video_id STRING,
    frame_id STRING,
    position_wgs84 GEOMETRY,
    solution_grade INT,
    source_system STRING,
    created_at TIMESTAMP
);

SELECT f.frame_id, ST_DISTANCE(g.position_wgs84, s.signal_position) AS meters_to_signal
FROM frame_geo_points g
JOIN signal_contact_points s USING(frame_id)
JOIN frame_events f USING(frame_id)
WHERE ST_WITHIN(g.position_wgs84,
      TO_GEOMETRY('SRID=4326;POLYGON((114.05 22.54, 114.13 22.54, 114.13 22.57, 114.05 22.57, 114.05 22.54))'));

HTTP geometry metadata simplifies the process for GIS apps to render these shapes, while timezone-aware columns keep timestamps aligned with the driving corridor.

Tie it all together with one SQL statement that filters JSON labels, finds vector neighbors, and checks distance-to-signal—no extra systems required:

WITH json_hits AS (
    SELECT doc_id
    FROM frame_metadata_catalog
    WHERE QUERY('meta_json.media_meta.tagging.labels:pedestrian')
),
vector_hits AS (
    SELECT frame_id
    FROM frame_embeddings
    WHERE COSINE_DISTANCE(
          embedding,
          (SELECT embedding FROM frame_embeddings WHERE frame_id='FRAME-0102')
    ) < 0.3
)
SELECT f.frame_id, v.route_name, s.distance_m
FROM frame_events f
JOIN citydrive_videos v USING(video_id)
JOIN json_hits j ON j.doc_id = f.frame_id
JOIN vector_hits vh ON vh.frame_id = f.frame_id
JOIN signal_contact_points s USING(frame_id);

This single query spans structured facts, JSON metadata, vector similarity, and geospatial distance—all executed in Databend. That’s unified multimodal retrieval.

📘 Follow the guides in order (SQL Analytics → JSON & Search → Vector Search → Geo Analytics) to recreate this exact end-to-end flow on your cluster. One warehouse, every search pattern.

Databend operates as a unified retrieval engine: classic SQL, JSON search, vector similarity, bitmap cohorts, geospatial filters, and lakehouse ETL all share the same tables, tasks, and policies. You can pull insights from CityDrive-style datasets without juggling extra services or pipelines.

Try it today: https://databend.com

The Databend Team Redefining what's possible with data

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

Databend Data Ingestion Performance Test: From 3K rows/s to 2.3M rows/s - How to Choose Among 4 Methods

Why Databend Use Cases Empower Businesses to Innovate Faster

Databend Monthly Report: November 2025 — Multimodal Query Intelligence

DatabendLabsDec 4, 2025

By the Numbers

Monthly Highlights

🔥 Major New Features

🧠 Multimodal Query & Analytics

🛡️ Governance & Trust

⚙️ Performance & Operations

🐛 Reliability & Quality

Spotlight: Multimodal Data Analytics (CityDrive Scenario)

0. Automated Data Ingest

1. Relational SQL Analytics (Guide: SQL Analytics)

2. JSON & Search (Guide: JSON & Search)

3. Vector Search (Guide: Vector Search)

4. Geo Analytics (Guide: Geo Analytics)

Subscribe to our newsletter

Get Started

Talk to us

Products

Resources

Community

Company

Solutions

Partners