Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond
DatabendLabsJun 26, 2025
Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond
Introduction
"How do we extract/summarize/complete/validate our data using LLM functions?"
"Can we detect anomalies in this time-series data using our existing proprietary functions?"
"We need to analyze sentiment in these customer reviews, but there's nothing built-in for that."
"How do we connect our gaming matchmaking system to our analytics pipeline?"
If these questions sound familiar, you're not alone. As data teams scale their analytics capabilities, they inevitably hit the limitations of built-in database functions. That's where Databend's User-Defined Functions (UDFs) come in – and they're transforming how we approach custom data processing.
Why UDFs Matter for Modern Data Teams
From financial services to e-commerce to gaming, organizations are leveraging UDFs to:
- Embed domain expertise and proprietary business logic directly into data pipelines
- Process data where it lives by eliminating unnecessary data movement while maintaining security and compliance
- Connect to specialized services – enabling companies to use the same algorithms in both operations and analytics
- Extend analytics capabilities without waiting for feature requests
Databend's UDF Ecosystem: Power Meets Flexibility
What makes Databend's approach to UDFs particularly powerful is its multi-language support and deployment options:
- Lambda UDFs: Quick SQL expressions for simple transformations
- Python & JavaScript UDFs: Familiar languages for complex logic and ML
- WASM UDFs: Near-native performance for compute-intensive operations
- External UDF Servers: Scalable microservices for enterprise workloads
In the following sections, we'll explore each approach with practical examples and performance insights to help you choose the right tool for your specific needs.
Lambda UDFs: SQL-Powered Simplicity
Lambda UDFs are the simplest form of user-defined functions in Databend, allowing you to define custom expressions directly in SQL. Think of them as SQL "shorthand" that encapsulates frequently used logic into reusable components.
When to Use Lambda UDFs
Lambda UDFs excel in scenarios where you need:
- Data cleansing and standardization across multiple queries
- Simple calculations that combine several built-in functions
- Text manipulation and pattern matching beyond basic SQL capabilities
For data engineers working with messy data sources or implementing business rules, Lambda UDFs provide a way to define the logic once and reuse it everywhere with lambda expression, ensuring consistency and reducing query complexity.
Syntax and Examples
-- Simple data cleaning function
CREATE FUNCTION clean_phone AS (s) ->
regexp_replace(s, '[^0-9]', '');
-- More complex example with multiple parameters and logic
CREATE FUNCTION calculate_discount AS (base_price, customer_tier, purchase_history) ->
CASE
WHEN customer_tier = 'premium' AND purchase_history > 10000 THEN base_price * 0.85
WHEN customer_tier = 'premium' THEN base_price * 0.90
WHEN purchase_history > 5000 THEN base_price * 0.95
ELSE base_price
END;
-- Using the functions in a query
SELECT
customer_id,
clean_phone(raw_phone_number) AS standardized_phone,
calculate_discount(item_price, customer_tier, ytd_purchases) AS discounted_price
FROM customer_orders;
UDF Script: enhancing your data with python, javascript and WASM
While Lambda UDFs are perfect for simple SQL expressions, more complex data processing often requires the full power of programming languages. Databend's UDF Script capability allows you to embed Python, JavaScript, and compile Rust, go, C++ or Zig to WebAssembly code directly into your data pipeline, bringing sophisticated algorithms right to where your data lives.
Python UDFs: A Taste of Data Science
Python UDFs unlock the vast ecosystem of Python libraries for data science, machine learning, and AI. This is particularly valuable when you need to apply complex algorithms or interact your data natively with external services.
Sample use case: E-Commerce Order Validation
An e-commerce retailer wants to ensure that shipping addresses entered by customers are clean and valid before fulfillment. The data team automates the process of standardizing address strings and flagging obviously invalid entries (such as addresses missing a street number).
To achieve this, the data engineer writes a Python UDF in Databend (using only Python’s standard library). The UDF receives a raw address string, trims whitespace, converts it to title case, and checks if it contains at least one digit (to indicate a street number). It returns a standardized address if valid, or NULL otherwise.
CREATE FUNCTION standardize_address( STRING)
RETURNS STRING
LANGUAGE python
HANDLER = 'clean_address'
AS $$
def clean_address(address):
if not address or not isinstance(address, str):
return None
cleaned = address.strip().title()
# Check for at least one digit (street number)
if any(char.isdigit() for char in cleaned):
return cleaned
return None
$$;
-- Use the UDF to clean and validate addresses in orders
SELECT
order_id,
customer_id,
standardize_address(shipping_address) AS standardized_address
FROM orders
WHERE standardize_address(shipping_address) IS NOT NULL;
JavaScript UDFs: Lightweight Data Wrangling with Security
JavaScript UDFs in Databend provide a secure, sandboxed environment for data transformations. They are particularly well-suited for handling semi-structured data like JSON, offering a balance of flexibility, security, and performance for common data preparation tasks.
Sample use case: E-commerce Event Log Processing
An e-commerce platform captures raw user interaction events as JSON objects. Before this data can be used for analytics, the data engineering team needs to clean, enrich, and standardize it. Specifically, they need to:
- Redact PII: Remove sensitive information like IP addresses.
- Enrich Data: Add a processing timestamp to each event.
- Standardize Fields: Ensure certain fields, like country codes, adhere to a consistent format.
To handle this, they implement a JavaScript UDF in Databend.
-- This UDF processes a JSON event payload:
-- 1. Deletes the 'ip_address' field from 'user_details'.
-- 2. Adds a 'processed_at_udf' timestamp to 'metadata'.
-- 3. Converts 'country_code' in 'location_details' to uppercase.
-- 4. Adds a default 'source' if missing in 'event_properties'.
CREATE FUNCTION process_event_payload( VARIANT)
RETURNS VARIANT
LANGUAGE javascript
HANDLER = 'transform_event'
AS $$
export function transform_event(event) {
// Ensure event is a valid object
if (!event || typeof event !== 'object') {
return null;
}
// 1. Delete PII (e.g., user's IP address)
if (event.user_details && typeof event.user_details === 'object') {
delete event.user_details.ip_address;
}
// 2. Add a new field (e.g., processing timestamp)
if (typeof event.metadata !== 'object' || event.metadata === null) {
event.metadata = {}; // Initialize if not an object or is null
}
event.metadata.processed_at_udf = new Date().toISOString();
// 3. Update/Standardize a field (e.g., ensure country code is uppercase)
if (event.location_details && typeof event.location_details.country_code === 'string') {
event.location_details.country_code = event.location_details.country_code.toUpperCase();
}
// 4. Add a default value if a field is missing
if (typeof event.event_properties !== 'object' || event.event_properties === null) {
event.event_properties = {}; // Initialize if not an object or is null
}
if (typeof event.event_properties.source === 'undefined') {
event.event_properties.source = 'unknown_source_js_udf';
}
return event;
}
$$;
-- Using the JavaScript UDF to transform raw event data
WITH raw_events AS (
SELECT PARSE_JSON('{
"event_id": "evt_123",
"user_details": { "user_id": "usr_abc", "ip_address": "192.168.1.100", "email": "[email protected]" },
"location_details": { "city": "San Francisco", "country_code": "us" },
"event_properties": { "page_url": "/products/awesome-widget" }
}') AS payload
UNION ALL
SELECT PARSE_JSON('{
"event_id": "evt_456",
"user_details": { "user_id": "usr_def", "ip_address": "10.0.0.5" },
"location_details": { "city": "London", "country_code": "gb" },
"event_properties": null
}') AS payload
)
SELECT
payload:event_id::STRING AS original_event_id,
process_event_payload(payload) AS processed_payload
FROM raw_events;
WASM UDFs: Near-Native Performance without sacrifice security
For computationally intensive operations requiring high performance, WebAssembly (WASM) UDFs enable near-native execution speeds within a secure, sandboxed environment. Unlike Python and JavaScript UDFs, which interpret source code at runtime, WASM UDFs are compiled to a binary format that executes efficiently and securely directly in the engine.
At first, you need to setup your rust environment, and add
arrow-udf
// Rust code (compiled to WASM)
use arrow_udf::function;
#[function("fib(int) -> int")]
fn fib(n: i32) -> i32 {
let (mut a, mut b) = (0, 1);
for _ in 0..n {
let c = a + b;
a = b;
b = c;
}
a
}
Compile your Rust code to the
wasm32-wasip1
cargo build --release --target wasm32-wasip1
Upload the compiled WASM file (e.g.,
arrow_udf_example.wasm
target/wasm32-wasip1/release/
-- Create a stage if you don't have one
CREATE STAGE IF NOT EXISTS my_wasm_stage;
-- Upload the .wasm file to the stage (replace with your actual path and stage name)
-- Example: PUT fs:///path/to/your/project/target/wasm32-wasip1/release/arrow_udf_example.wasm @my_wasm_stage;
PUT fs:///tmp/arrow_udf_example.wasm @my_wasm_stage;
-- Create the WASM UDF
CREATE OR REPLACE FUNCTION fib_wasm (INT)
RETURNS INT
LANGUAGE wasm
HANDLER = 'fib'
AS $$@my_wasm_stage/arrow_udf_example.wasm$$;
-- Using the WASM UDF
SELECT fib_wasm(10) AS fibonacci_number;
External UDF Server: interact your data seamlessly with your existing services
If you want to use computed data to interact with your business services seamlessly, you definitely need to consider External UDF servers. These servers, communicating with Databend via the Arrow Flight protocol, allow you to decouple your UDF logic from the database and integrate with existing microservices or specialized computation engines.
For instance, one of our biggest customers in the gaming industry leverages External UDF servers to process real-time, sub-second computed data. This data is crucial for their game matching algorithms and feature engineering services, handling over 1,000 QPS with a P99 latency of less than 200ms. This showcases the power and scalability of External UDFs for demanding, low-latency applications.
Here's how you can set up and use an External UDF Server, using a Fibonacci example implemented in Python:
1. Implement and Run the External UDF Server (Python Example):
First, you'll need the
databend_udf
pip install databend-udf
Then, create a Python script (e.g.,
my_udf_server.py
from databend_udf import udf, UDFServer
import logging
logging.basicConfig(level=logging.INFO)
@udf(
input_types=["INT"], # Corresponds to Databend's INT type
result_type="INT", # Corresponds to Databend's INT type
skip_null=True, # Optional: if True, null inputs won't be passed to the function
)
def fib(n: int) -> int:
a, b = 0, 1
for _ in range(n):
a, b = b, a + b
return a
if __name__ == "__main__":
# The UDF server will listen on this address
udf_server = UDFServer(
location="0.0.0.0:8815"
)
# Register the function with the server
udf_server.add_function(fib)
# Start the server
udf_server.serve()
Run this Python script. It will start a server listening on
0.0.0.0:8815
python my_udf_server.py
2. Create the External Function in Databend:
Once your UDF server is running, you can register it in Databend:
-- Create the external function, pointing to the running server
CREATE OR REPLACE FUNCTION fib_external (val INT)
RETURNS INT
LANGUAGE python
HANDLER = 'fib' -- This must match the function name in your Python script
ADDRESS = 'http://0.0.0.0:8815'; -- The address of your UDF server
3. Using the External UDF:
Now you can call this function in your SQL queries just like any other UDF:
SELECT fib_external(10) AS fibonacci_number;
-- Expected output: 55
Best Practices
Most Databend users don’t care about UDF theory—they care about what works, what breaks, and what saves time. After seeing dozens of teams (and making plenty of mistakes myself), here’s the honest truth:
If you can do it in SQL, do it in SQL. Lambda UDFs are instant, transparent, and you’ll thank yourself later when you need to debug at 2am. But don’t force it—once your logic starts looking like a regex contest or you’re copying the same formula everywhere, switch gears.
Python and JavaScript UDFs are the real workhorses. Python is the default for anything that smells like business logic, validation, LLM, RAG, or data wrangling. JavaScript is the secret weapon for JSON or semi-structured messes—especially if you’re coming from a web background. Both are easy to write, but don’t expect magic performance. If you’re processing millions of rows per second, you’ll hit a wall. For most ETL, though? They’re perfect.
WASM is for the performance-obsessed. If you’re building a recommendation engine, crunching numbers for a game, or your boss keeps asking “can this be faster?”—bite the bullet, learn some Rust, and ship a WASM UDF. It’s not easy, but it’s the only way to get close to native speed inside Databend. Most people never need it. If you do, you’ll know.
External UDF Servers are for teams with real infrastructure—when you already have a service, a model, or a matching engine you trust, and you want Databend to tap into it. There’s network overhead, but you get freedom: scale compute separately, deploy in your own stack, and keep business logic where it belongs. Just remember to batch requests, or you’ll be debugging latency charts all day.
We do have some basic benchmarking the average execution time per row for different Databend UDF types using the following SQL, but performance is not the only factor to consider when choosing a UDF type. It is recommended to measure the performance of your specific use case(like simplicity, ease of debugging, ease of maintenance, etc.) before making a decision.
SELECT fib((n % 10)::Int32) FROM range(1, 1000000) t(n) IGNORE_RESULT;
Here are the results:
UDF Type | Avg. Time per Row (µs) | Typical Use Case |
---|---|---|
Lambda UDF | - | Simple transforms, prototyping |
Python UDF | 0.18 | Complex logic, AI integration |
JavaScript UDF | 2.68 | Lightweight data processing |
WASM UDF | 0.11 | High-performance computation |
External UDF | 23.2 | Large-scale, distributed workloads |
Note: External UDF timing includes network overhead; actual compute time is often less.
The best teams start simple, refactor when it hurts, and always measure before optimizing. Don’t overthink it. Every UDF type has its place—what matters is using the right one for your problem, not the fanciest one in the docs.
If you’re still unsure, ask around—most Databend users are happy to share what worked (and what didn’t) in production.
Getting Started
Subscribe to our newsletter
Stay informed on feature releases, product roadmap, support, and cloud offerings!