Blog

Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond

DatabendLabsJun 26, 2025

Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond

Introduction

"How do we extract/summarize/complete/validate our data using LLM functions?"

"Can we detect anomalies in this time-series data using our existing proprietary functions?"

"We need to analyze sentiment in these customer reviews, but there's nothing built-in for that."

"How do we connect our gaming matchmaking system to our analytics pipeline?"

If these questions sound familiar, you're not alone. As data teams scale their analytics capabilities, they inevitably hit the limitations of built-in database functions. That's where Databend's User-Defined Functions (UDFs) come in – and they're transforming how we approach custom data processing.

Why UDFs Matter for Modern Data Teams

From financial services to e-commerce to gaming, organizations are leveraging UDFs to:

Embed domain expertise and proprietary business logic directly into data pipelines
Process data where it lives by eliminating unnecessary data movement while maintaining security and compliance
Connect to specialized services – enabling companies to use the same algorithms in both operations and analytics
Extend analytics capabilities without waiting for feature requests

Databend's UDF Ecosystem: Power Meets Flexibility

What makes Databend's approach to UDFs particularly powerful is its multi-language support and deployment options:

Lambda UDFs: Quick SQL expressions for simple transformations
Python & JavaScript UDFs: Familiar languages for complex logic and ML
WASM UDFs: Near-native performance for compute-intensive operations
External UDF Servers: Scalable microservices for enterprise workloads

In the following sections, we'll explore each approach with practical examples and performance insights to help you choose the right tool for your specific needs.

Lambda UDFs: SQL-Powered Simplicity

Lambda UDFs are the simplest form of user-defined functions in Databend, allowing you to define custom expressions directly in SQL. Think of them as SQL "shorthand" that encapsulates frequently used logic into reusable components.

When to Use Lambda UDFs

Lambda UDFs excel in scenarios where you need:

Data cleansing and standardization across multiple queries
Simple calculations that combine several built-in functions
Text manipulation and pattern matching beyond basic SQL capabilities

For data engineers working with messy data sources or implementing business rules, Lambda UDFs provide a way to define the logic once and reuse it everywhere with lambda expression, ensuring consistency and reducing query complexity.

Syntax and Examples

-- Simple data cleaning function
CREATE FUNCTION clean_phone AS (s) ->
  regexp_replace(s, '[^0-9]', '');

-- More complex example with multiple parameters and logic
CREATE FUNCTION calculate_discount AS (base_price, customer_tier, purchase_history) ->
  CASE
    WHEN customer_tier = 'premium' AND purchase_history > 10000 THEN base_price * 0.85
    WHEN customer_tier = 'premium' THEN base_price * 0.90
    WHEN purchase_history > 5000 THEN base_price * 0.95
    ELSE base_price
  END;

-- Using the functions in a query
SELECT
  customer_id,
  clean_phone(raw_phone_number) AS standardized_phone,
  calculate_discount(item_price, customer_tier, ytd_purchases) AS discounted_price
FROM customer_orders;

UDF Script: enhancing your data with python, javascript and WASM

While Lambda UDFs are perfect for simple SQL expressions, more complex data processing often requires the full power of programming languages. Databend's UDF Script capability allows you to embed Python, JavaScript, and compile Rust, go, C++ or Zig to WebAssembly code directly into your data pipeline, bringing sophisticated algorithms right to where your data lives.

Python UDFs: A Taste of Data Science

Python UDFs unlock the vast ecosystem of Python libraries for data science, machine learning, and AI. This is particularly valuable when you need to apply complex algorithms or interact your data natively with external services.

Sample use case: E-Commerce Order Validation

An e-commerce retailer wants to ensure that shipping addresses entered by customers are clean and valid before fulfillment. The data team automates the process of standardizing address strings and flagging obviously invalid entries (such as addresses missing a street number).

To achieve this, the data engineer writes a Python UDF in Databend (using only Python’s standard library). The UDF receives a raw address string, trims whitespace, converts it to title case, and checks if it contains at least one digit (to indicate a street number). It returns a standardized address if valid, or NULL otherwise.

CREATE FUNCTION standardize_address( STRING)
RETURNS STRING
LANGUAGE python
HANDLER = 'clean_address'
AS $$
def clean_address(address):
    if not address or not isinstance(address, str):
        return None
    cleaned = address.strip().title()
    # Check for at least one digit (street number)
    if any(char.isdigit() for char in cleaned):
        return cleaned
    return None
$$;

-- Use the UDF to clean and validate addresses in orders
SELECT
  order_id,
  customer_id,
  standardize_address(shipping_address) AS standardized_address
FROM orders
WHERE standardize_address(shipping_address) IS NOT NULL;

JavaScript UDFs: Lightweight Data Wrangling with Security

JavaScript UDFs in Databend provide a secure, sandboxed environment for data transformations. They are particularly well-suited for handling semi-structured data like JSON, offering a balance of flexibility, security, and performance for common data preparation tasks.

Sample use case: E-commerce Event Log Processing

An e-commerce platform captures raw user interaction events as JSON objects. Before this data can be used for analytics, the data engineering team needs to clean, enrich, and standardize it. Specifically, they need to:

Redact PII: Remove sensitive information like IP addresses.
Enrich Data: Add a processing timestamp to each event.
Standardize Fields: Ensure certain fields, like country codes, adhere to a consistent format.

To handle this, they implement a JavaScript UDF in Databend.

-- This UDF processes a JSON event payload:
-- 1. Deletes the 'ip_address' field from 'user_details'.
-- 2. Adds a 'processed_at_udf' timestamp to 'metadata'.
-- 3. Converts 'country_code' in 'location_details' to uppercase.
-- 4. Adds a default 'source' if missing in 'event_properties'.
CREATE FUNCTION process_event_payload( VARIANT)
RETURNS VARIANT
LANGUAGE javascript
HANDLER = 'transform_event'
AS $$
export function transform_event(event) {
  // Ensure event is a valid object
  if (!event || typeof event !== 'object') {
    return null;
  }

  // 1. Delete PII (e.g., user's IP address)
  if (event.user_details && typeof event.user_details === 'object') {
    delete event.user_details.ip_address;
  }

  // 2. Add a new field (e.g., processing timestamp)
  if (typeof event.metadata !== 'object' || event.metadata === null) {
    event.metadata = {}; // Initialize if not an object or is null
  }
  event.metadata.processed_at_udf = new Date().toISOString();

  // 3. Update/Standardize a field (e.g., ensure country code is uppercase)
  if (event.location_details && typeof event.location_details.country_code === 'string') {
    event.location_details.country_code = event.location_details.country_code.toUpperCase();
  }

  // 4. Add a default value if a field is missing
  if (typeof event.event_properties !== 'object' || event.event_properties === null) {
    event.event_properties = {}; // Initialize if not an object or is null
  }
  if (typeof event.event_properties.source === 'undefined') {
      event.event_properties.source = 'unknown_source_js_udf';
  }

  return event;
}
$$;

-- Using the JavaScript UDF to transform raw event data
WITH raw_events AS (
  SELECT PARSE_JSON('{
    "event_id": "evt_123",
    "user_details": { "user_id": "usr_abc", "ip_address": "192.168.1.100", "email": "[email protected]" },
    "location_details": { "city": "San Francisco", "country_code": "us" },
    "event_properties": { "page_url": "/products/awesome-widget" }
  }') AS payload
  UNION ALL
  SELECT PARSE_JSON('{
    "event_id": "evt_456",
    "user_details": { "user_id": "usr_def", "ip_address": "10.0.0.5" },
    "location_details": { "city": "London", "country_code": "gb" },
    "event_properties": null
  }') AS payload
)
SELECT
  payload:event_id::STRING AS original_event_id,
  process_event_payload(payload) AS processed_payload
FROM raw_events;

WASM UDFs: Near-Native Performance without sacrifice security

For computationally intensive operations requiring high performance, WebAssembly (WASM) UDFs enable near-native execution speeds within a secure, sandboxed environment. Unlike Python and JavaScript UDFs, which interpret source code at runtime, WASM UDFs are compiled to a binary format that executes efficiently and securely directly in the engine.

At first, you need to setup your rust environment, and add

arrow-udf

as dependency, implement your udf function.

// Rust code (compiled to WASM)
use arrow_udf::function;

#[function("fib(int) -> int")]
fn fib(n: i32) -> i32 {
    let (mut a, mut b) = (0, 1);
    for _ in 0..n {
        let c = a + b;
        a = b;
        b = c;
    }
    a
}

Compile your Rust code to the

wasm32-wasip1

target:

cargo build --release --target wasm32-wasip1

Upload the compiled WASM file (e.g.,

arrow_udf_example.wasm

found in

target/wasm32-wasip1/release/

) to a Databend stage and create the function:

-- Create a stage if you don't have one
CREATE STAGE IF NOT EXISTS my_wasm_stage;

-- Upload the .wasm file to the stage (replace with your actual path and stage name)
-- Example: PUT fs:///path/to/your/project/target/wasm32-wasip1/release/arrow_udf_example.wasm @my_wasm_stage;
PUT fs:///tmp/arrow_udf_example.wasm @my_wasm_stage;

-- Create the WASM UDF
CREATE OR REPLACE FUNCTION fib_wasm (INT)
RETURNS INT
LANGUAGE wasm
HANDLER = 'fib'
AS $$@my_wasm_stage/arrow_udf_example.wasm$$;

-- Using the WASM UDF
SELECT fib_wasm(10) AS fibonacci_number;

External UDF Server: interact your data seamlessly with your existing services

If you want to use computed data to interact with your business services seamlessly, you definitely need to consider External UDF servers. These servers, communicating with Databend via the Arrow Flight protocol, allow you to decouple your UDF logic from the database and integrate with existing microservices or specialized computation engines.

For instance, one of our biggest customers in the gaming industry leverages External UDF servers to process real-time, sub-second computed data. This data is crucial for their game matching algorithms and feature engineering services, handling over 1,000 QPS with a P99 latency of less than 200ms. This showcases the power and scalability of External UDFs for demanding, low-latency applications.

Here's how you can set up and use an External UDF Server, using a Fibonacci example implemented in Python:

1. Implement and Run the External UDF Server (Python Example):

First, you'll need the

databend_udf

Python package. You can install it via pip:

pip install databend-udf

Then, create a Python script (e.g.,

my_udf_server.py

) with the following content:

from databend_udf import udf, UDFServer
import logging

logging.basicConfig(level=logging.INFO)

@udf(
    input_types=["INT"],  # Corresponds to Databend's INT type
    result_type="INT",    # Corresponds to Databend's INT type
    skip_null=True,       # Optional: if True, null inputs won't be passed to the function
)
def fib(n: int) -> int:
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

if __name__ == "__main__":
    # The UDF server will listen on this address
    udf_server = UDFServer(
        location="0.0.0.0:8815"
    )
    # Register the function with the server
    udf_server.add_function(fib)
    # Start the server
    udf_server.serve()

Run this Python script. It will start a server listening on

0.0.0.0:8815

python my_udf_server.py

2. Create the External Function in Databend:

Once your UDF server is running, you can register it in Databend:

-- Create the external function, pointing to the running server
CREATE OR REPLACE FUNCTION fib_external (val INT)
RETURNS INT
LANGUAGE python
HANDLER = 'fib'  -- This must match the function name in your Python script
ADDRESS = 'http://0.0.0.0:8815'; -- The address of your UDF server

3. Using the External UDF:

Now you can call this function in your SQL queries just like any other UDF:

SELECT fib_external(10) AS fibonacci_number;
-- Expected output: 55

Best Practices

Most Databend users don’t care about UDF theory—they care about what works, what breaks, and what saves time. After seeing dozens of teams (and making plenty of mistakes myself), here’s the honest truth:

If you can do it in SQL, do it in SQL. Lambda UDFs are instant, transparent, and you’ll thank yourself later when you need to debug at 2am. But don’t force it—once your logic starts looking like a regex contest or you’re copying the same formula everywhere, switch gears.

Python and JavaScript UDFs are the real workhorses. Python is the default for anything that smells like business logic, validation, LLM, RAG, or data wrangling. JavaScript is the secret weapon for JSON or semi-structured messes—especially if you’re coming from a web background. Both are easy to write, but don’t expect magic performance. If you’re processing millions of rows per second, you’ll hit a wall. For most ETL, though? They’re perfect.

WASM is for the performance-obsessed. If you’re building a recommendation engine, crunching numbers for a game, or your boss keeps asking “can this be faster?”—bite the bullet, learn some Rust, and ship a WASM UDF. It’s not easy, but it’s the only way to get close to native speed inside Databend. Most people never need it. If you do, you’ll know.

External UDF Servers are for teams with real infrastructure—when you already have a service, a model, or a matching engine you trust, and you want Databend to tap into it. There’s network overhead, but you get freedom: scale compute separately, deploy in your own stack, and keep business logic where it belongs. Just remember to batch requests, or you’ll be debugging latency charts all day.

We do have some basic benchmarking the average execution time per row for different Databend UDF types using the following SQL, but performance is not the only factor to consider when choosing a UDF type. It is recommended to measure the performance of your specific use case(like simplicity, ease of debugging, ease of maintenance, etc.) before making a decision.

SELECT fib((n % 10)::Int32) FROM range(1, 1000000) t(n) IGNORE_RESULT;

Here are the results:

UDF Type	Avg. Time per Row (µs)	Typical Use Case
Lambda UDF	-	Simple transforms, prototyping
Python UDF	0.18	Complex logic, AI integration
JavaScript UDF	2.68	Lightweight data processing
WASM UDF	0.11	High-performance computation
External UDF	23.2	Large-scale, distributed workloads

Note: External UDF timing includes network overhead; actual compute time is often less.

The best teams start simple, refactor when it hurts, and always measure before optimizing. Don’t overthink it. Every UDF type has its place—what matters is using the right one for your problem, not the fanciest one in the docs.

If you’re still unsure, ask around—most Databend users are happy to share what worked (and what didn’t) in production.

Getting Started

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

Real-World Success Stories with Databend Cloud

Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond

DatabendLabsJun 26, 2025

Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond

Introduction

Why UDFs Matter for Modern Data Teams

Databend's UDF Ecosystem: Power Meets Flexibility

Lambda UDFs: SQL-Powered Simplicity

When to Use Lambda UDFs

Syntax and Examples

UDF Script: enhancing your data with python, javascript and WASM

Python UDFs: A Taste of Data Science

JavaScript UDFs: Lightweight Data Wrangling with Security

WASM UDFs: Near-Native Performance without sacrifice security

External UDF Server: interact your data seamlessly with your existing services

Best Practices

Getting Started

Subscribe to our newsletter

Get Started

Talk to us

Products

Resources

Community

Company

Solutions