Blog

Why We Built Our Own SQL Parser From Scratch: A Rust Implementation Story

avatarDatabendLabsSep 10, 2025
Why We Built Our Own SQL Parser From Scratch: A Rust Implementation Story

Introduction

In late 2023, we had a problem. A customer reported their analytics query was taking 20 seconds to run. The query itself wasn't complex - some CTEs, aggregations, nothing unusual. But when we profiled it, we found something shocking: 13.29 seconds were spent just parsing the SQL. Not executing it. Not optimizing it. Just turning text into an AST.

This was Issue #7225 in our tracker, and it forced us to confront an uncomfortable truth: our SQL parser had become the bottleneck.

We were using sqlparser-rs, a popular Rust SQL parser. It's a solid library used by many projects. But as Databend grew to handle larger queries, more SQL dialects, and demanding performance requirements, we hit its limits. The question wasn't whether to optimize - it was whether to fix sqlparser-rs or build our own. Issue #1218 tracked our parser refactor journey.

This is the story of why we chose to build our own, the challenges we faced, and how Rust's type system guided us to solutions we wouldn't have found otherwise. Along the way, we'll share working code that demonstrates the key concepts.

Part 1: Why sqlparser-rs Wasn't Working

The Breaking Point

Here's a simplified version of the query that broke us:

WITH RECURSIVE hierarchy AS (
SELECT id, parent_id, name, 1 as level
FROM departments WHERE parent_id IS NULL
UNION ALL
SELECT d.id, d.parent_id, d.name, h.level + 1
FROM departments d
JOIN hierarchy h ON d.parent_id = h.id
),
aggregated AS (
SELECT h.*, COUNT(e.id) as employee_count
FROM hierarchy h
LEFT JOIN employees e ON h.id = e.dept_id
GROUP BY h.id, h.parent_id, h.name, h.level
),
filtered AS (
SELECT * FROM aggregated WHERE employee_count > 100
)
SELECT * FROM filtered WHERE level <= 5
ORDER BY level, employee_count DESC;

The actual customer query was 200+ lines with 8 CTEs. Here's what we measured:

  • Total execution time: 20 seconds
  • Parse time: 13.29 seconds (66% of total!)
  • Memory during parsing: 5x the query text size
  • Error message when user had a typo: "syntax error"

Three Dealbreakers

1. Limited Dialect Support (Issue #866)

In Issue #866, our team asked: "Is there a better SQL parser?" The core problem wasn't that sqlparser-rs was bad - it's actually a well-designed library. The problem was that we needed something different.

sqlparser-rs is designed to parse standard SQL with some dialect support. But we needed to dynamically support whatever SQL dialect our customers were using. When a customer migrates from MySQL, PostgreSQL, or Hive, they expect their existing queries to work.

The challenge with extending sqlparser-rs:

// To add a new dialect feature in sqlparser-rs:
// 1. Understand the entire parser structure
// 2. Find all places that need modification
// 3. Ensure changes don't break other dialects
// 4. Submit PR and hope it gets accepted
// 5. Wait for release
// 6. If rejected, maintain a fork

Each dialect has unique syntax:

  • MySQL uses backticks for identifiers:
    `column`
  • PostgreSQL has
    ::
    for casting:
    '123'::integer

sqlparser-rs handles some of these through a dialect system, but adding new dialect-specific features requires modifying core parsing logic. For a project like Databend that needs to support multiple protocols (MySQL wire protocol, PostgreSQL compatibility), we needed more flexibility.

The discussion in Issue #866 led us to explore alternatives like nom parser combinators, which would let us compose different parsing rules dynamically.

2. Poor Error Messages

Consider this common scenario: A user writes a complex query with multiple SELECT statements joined by UNION. They get an error:

"syntax error at or near SELECT"

But which SELECT? The query has 5 of them. They spend 30 minutes looking for the problem.

The actual issue might be a missing comma on line 87, nowhere near any SELECT keyword. Traditional parsers like sqlparser-rs often backtrack to the nearest major keyword when encountering errors, losing valuable context about where the actual problem occurred.

Let me show you what happens inside a traditional parser when it hits an error:

-- User's query (simplified)
SELECT
user_id,
user_name,
email
user_age, -- Missing comma after 'email'
address
FROM users

The parser sees:

  1. SELECT
  2. user_id,
  3. user_name,
  4. email
  5. user_age
    ← Wait, expected comma or FROM, got identifier
  6. Backtrack... maybe this isn't a SELECT?
  7. Report: "syntax error at or near SELECT"

The parser was technically correct - there was a syntax error in a SELECT statement. But it threw away the most useful information: WHERE the parser was when it got confused.

Even worse, when users made typos in keywords:

SELCT * FORM users WHEER age > 18

sqlparser-rs would report: "Expected statement, found SELCT". It didn't try to guess what you meant. It didn't show where in your query the problem was. It just failed.

We needed error messages that provided more context:

Parse error at line 4, column 5
Expected ',' or 'FROM' after column name
Found: 'user_age'

3 | email
4 | user_age,
| ^^^^^^^^
|
Hint: Did you forget a comma after 'email'?

3. Memory Usage Issues

Large SQL files were causing memory problems. When customers ran migration scripts with thousands of INSERT statements or complex queries with many identifiers, memory usage would spike significantly.

The issue came from string copying. In traditional parsing approaches, strings get copied at multiple stages:

INSERT INTO customer_transactions VALUES ('data1', 'data2', ...)

Each string goes through several copies:

  1. Original input: Kept in memory
  2. Tokenization: Each identifier and literal becomes a new String
  3. AST construction: Strings copied again into AST nodes
  4. Transformations: Additional copies during query processing

For large SQL files, this multiplication effect was significant. A query with thousands of string literals would create thousands of String allocations, each with its own heap allocation overhead.

The problem wasn't unique to sqlparser-rs - most parsers have this issue. But for a database that needs to handle large migration scripts, data warehouse queries, and bulk operations, we needed a more memory-efficient approach.

This led us to explore zero-copy parsing techniques where tokens and AST nodes would reference the original input string rather than creating copies. Rust's lifetime system makes this possible and safe.

Part 2: Better Errors with Furthest-Error-Tracking

The Problem

Here's what happens when a parser tries to understand this typo-filled query:

SELCT name FORM users WHEER age > 18

A naive parser reports the first error:

Error at position 0: Expected SELECT, found SELCT

But that's not the whole story. The parser actually tries multiple possibilities:

Position 0: Try SELECT - failed (found SELCT)
Position 0: Try INSERT - failed (found SELCT)
Position 0: Try UPDATE - failed (found SELCT)
Position 0: Try DELETE - failed (found SELCT)
Position 0: Try WITH - failed (found SELCT)

It should report all these attempts, not just SELECT. But here's where it gets interesting. What if the query was:

SELECT name FORM users WHEER age > 18

Now the parser gets further:

Position 0-6: Parse SELECT - success!
Position 7-10: Parse name - success!
Position 11: Try FROM - failed (found FORM)

The error at position 11 is more valuable than position 0 - it shows the parser made progress before failing.

Our Solution: Track the Furthest Error

Think about how a human reads a typo-filled query. You don't stop at the first error - you scan ahead to understand what the writer intended. Our parser needed to do the same.

The key insight: The furthest position where parsing failed is usually the most helpful error to report. Why? Because it means the parser understood everything up to that point. This gives users confidence that the earlier parts of their query are correct.

We track every parse attempt and remember the furthest position reached:

// Conceptual error tracking structure
struct ErrorTracker {
furthest_pos: usize, // How far did we get?
expected: Vec<String>, // What were we looking for?
found: Option<String>, // What did we actually find?
}

The algorithm is simple: whenever we encounter an error, we check if it's further than our previous furthest error. If it is, this becomes our new "best" error to report. If it's at the same position, we add it to the list of expectations ("Expected SELECT, INSERT, or UPDATE").

The Implementation Approach

In our demo implementation, we used

RefCell
for interior mutability - allowing error tracking through immutable parser methods. This is one approach to the problem. The actual Databend parser takes a different approach, passing error context through the parsing functions.

The key insight remains the same: track the furthest position where parsing failed, as this gives users the most relevant error location. Whether you use RefCell, pass error context, or use another pattern, the goal is to preserve error information during backtracking.

Adding Suggestions

Once we know what went wrong, can we suggest what the user meant? We use Jaro-Winkler distance - an algorithm that measures how similar two strings are, giving extra weight to matching prefixes (since typos often happen at the end of words).

For example:

  • "SELCT" vs "SELECT" = 0.94 similarity (very likely a typo)
  • "FORM" vs "FROM" = 0.91 similarity (probably meant FROM)
  • "WHEER" vs "WHERE" = 0.93 similarity (missing one letter)

We suggest corrections when similarity exceeds 80% - high enough to catch typos, low enough to avoid nonsense suggestions.

The Result

Now our error messages are actually helpful:

Parse error at line 1:24
Expected WHERE, found 'WHEER'
Did you mean: WHERE

1 | SELECT name FROM users WHEER age > 18
| ^^^^^

Part 3: A Zero-Copy Architecture for Speed and Efficiency

The Memory Problem

sqlparser-rs (and many parsers) copy strings during tokenization:

// Simplified version of what happens
struct Token {
text: String, // Owned, allocated string
position: usize,
}

fn tokenize(input: String) -> Vec<Token> {
let mut tokens = Vec::new();

// Find a keyword
let keyword = &input[0..6]; // "SELECT"
tokens.push(Token {
text: keyword.to_string(), // ALLOCATION!
position: 0,
});

// ... more tokens
tokens
}

// For 100MB SQL:
// - Input: 100MB
// - Tokens: ~100MB in copied strings
// - AST: More copied strings
// Total: 300-400MB

Fighting the Borrow Checker

Our first attempt was naive - just use references:

// Attempt 1: Won't compile!
struct Token<'a> {
text: &'a str,
position: usize,
}

fn tokenize(input: String) -> Vec<Token> {
// ^^^^^^ input moved here
let mut tokens = Vec::new();
tokens.push(Token {
text: &input[0..6],
// ^^^^^^ Error: cannot return reference to local variable
position: 0,
});
tokens
}

The compiler is teaching us: we can't return references to data that's about to be dropped.

The Solution: Let the Caller Own the Data

After fighting for hours, we understood what Rust wanted:

// From src/token.rs
pub struct Token<'a> {
pub text: &'a str, // Reference to original input
pub kind: TokenKind,
pub span: Range<usize>, // Where in the input
}

pub fn tokenize(input: &str) -> Vec<Token<'_>> {
// ^^^^ Borrow, don't own
let mut tokens = Vec::new();
let mut lexer = TokenKind::lexer(input);

while let Some(result) = lexer.next() {
if let Ok(kind) = result {
let span = lexer.span();
let text = &input[span.clone()]; // Zero-copy slice!
tokens.push(Token { text, kind, span });
}
}

tokens
}

// Usage - caller keeps input alive
let sql = std::fs::read_to_string("query.sql")?;
let tokens = tokenize(&sql); // sql must outlive tokens

The Cascade Effect

Once we had zero-copy tokens, the entire AST became zero-copy:

// From src/ast.rs
pub struct SelectStmt<'a> {
pub columns: Vec<&'a str>, // Just slices
pub from: Vec<&'a str>, // No allocations
pub where_clause: Option<Box<Expr<'a>>>,
}

pub enum Expr<'a> {
Column(&'a str), // Reference
Literal(&'a str), // Reference
Binary {
left: Box<Expr<'a>>, // Box for recursion, but still zero-copy
op: BinaryOp,
right: Box<Expr<'a>>,
},
}

Part 4: The CTE Recursion Trap: Separating Syntax from Semantics

The Stack Overflow

A customer's recursive CTE with deep nesting crashed our parser with stack overflow:

WITH RECURSIVE deep_hierarchy AS (
SELECT 1 as level, 'root' as name
UNION ALL
SELECT level + 1, CONCAT('level_', level + 1)
FROM deep_hierarchy
WHERE level < 1000
)
SELECT * FROM deep_hierarchy;

The error:

thread 'main' has overflowed its stack

Our Wrong Approach

We spent two weeks trying to make the parser "understand" recursion. We created phantom tables, tracked recursive depth, and built complex resolution logic. The code grew to over 2,000 lines of intricate tree-walking algorithms.

Why did we do this? Because we thought the parser needed to validate that recursive references were correct. If a CTE references itself, surely the parser needs to know about that relationship?

This was fundamentally wrong thinking.

The Revelation

The breakthrough came from asking a simple question: What is the parser's job?

The parser's job is to understand syntax - the structure of the SQL. Is this a valid SELECT statement? Are the keywords in the right order? Are the parentheses balanced?

The parser's job is NOT to understand semantics - what the SQL means. Does this table exist? Is this column type-compatible? Can this CTE reference itself?

Recursive CTEs are syntactically just regular CTEs with a "RECURSIVE" keyword. The fact that they reference themselves is a semantic property that only matters during execution. The parser just needs to record the structure; the query planner figures out how to execute it.

The Simple Solution

Once we understood the parse vs. analyze distinction, the solution became simple. In Databend's AST:

pub struct With {
pub recursive: bool, // Just a flag for RECURSIVE keyword
pub ctes: Vec<CTE>, // List of CTEs
}

The parser doesn't need to track whether the query references the CTE name. It doesn't need phantom tables. It just needs to record the structure.

Later, during semantic analysis (a separate phase), the query planner will:

  1. See the
    recursive: true
    flag
  2. Set up iterative execution
  3. Handle the self-references

But that's not the parser's problem.

The Lesson

We deleted over 2,000 lines of complex, brittle code. The lesson was profound: a parser should be an expert in grammar, not a fortune-teller of intent. By enforcing a clean separation of concerns, we made the parser simpler, more robust, and easier to maintain. This principle—parse syntax, analyze semantics later—is now a cornerstone of our query engine design and helped resolve critical bugs like the binding panics in recursive CTEs (PR #17554).

Part 5: Advanced Implementation Patterns

Precedence as Data, Not Code

Most parsers handle operator precedence (multiplication before addition) by encoding it in the call stack - one function calls another in precedence order. This works but requires 15+ nearly identical functions.

Databend uses a Pratt parser with precedence as data. Instead of encoding precedence in function calls, operators are mapped to precedence values:

// Conceptual precedence mapping (similar to Databend's approach)
BinaryOperator::Or => Precedence(5), // Lowest precedence
BinaryOperator::And => Precedence(10),
BinaryOperator::Equal => Precedence(20),
BinaryOperator::Plus => Precedence(50),
BinaryOperator::Multiply => Precedence(60), // Highest precedence

pub fn parse_expression(&mut self) -> ParseResult<Expr<'a>> {
self.parse_expr_with_precedence(0)
}

fn parse_expr_with_precedence(&mut self, min_prec: u8) -> ParseResult<Expr<'a>> {
let mut left = self.parse_primary()?;

while let Some(token) = self.current() {
if let Some((prec, is_left_assoc)) = get_precedence(token.kind) {
if prec < min_prec {
break;
}

let op = BinaryOp::from_token(token.kind).unwrap();
self.advance();

let next_min_prec = if is_left_assoc { prec + 1 } else { prec };
let right = self.parse_expr_with_precedence(next_min_prec)?;

left = Expr::Binary {
left: Box::new(left),
op,
right: Box::new(right),
};
} else {
break;
}
}

Ok(left)
}

This "precedence climbing" algorithm handles all binary operators in one function. Adding a new operator? Just add one line to the precedence table. Compare this to the traditional approach where you'd need to find the right function in the precedence chain and modify it.

The data-driven approach is not only simpler but also faster - fewer function calls means less stack usage.

Property Testing

Unit tests check specific cases you think of. Property tests check thousands of cases you'd never imagine.

Property testing helps find edge cases by generating thousands of random inputs. The key property we test: the parser should never crash, regardless of input. This approach helps uncover issues like:

  • Handling of unusual Unicode characters
  • Deeply nested expressions that might cause stack issues
  • Edge cases in precedence handling

Property testing also helped ensure AST display correctness (PR #15069). The property: "For any valid SQL, if we parse it, convert it back to a string, and parse again, we should get an identical AST." This ensures our parser's output is consistent and reversible.

Logic Fuzzing with sqllancer

Beyond testing the parser's structure, we use

sqllancer
to test the entire query execution pipeline.
sqllancer
is a powerful logic testing framework that generates syntactically correct but semantically random SQL queries.

Here's how it works:

  1. sqllancer
    generates a complex SQL query.
  2. We run this query on our database (using our new parser) and a reference database like PostgreSQL or MySQL.
  3. We compare the results. If they differ, we've found a bug.

This approach has been invaluable for finding subtle bugs in our query optimizer and execution engine that unit tests would miss. For example,

sqllancer
helped us uncover incorrect results for complex
JOIN
conditions with
GROUP BY
and window functions, which were parsed correctly but executed improperly. It ensures that our parser not only creates a correct AST but that the AST leads to correct query results, closing the loop on correctness.

Part 6: Real-World Impact

Real-World Impact

The main improvement we measured was on the complex query that triggered this work:

Query TypeBeforeAfterImprovement
Complex nested SQL (Issue #7225)13.29s parsing time~4s parsing time3.3x faster
Memory usage (relative to input)Multiple times input size~1.2x input sizeSignificantly reduced
Error messagesGeneric "syntax error"Position + context + suggestionsMore actionable

The improved error messages made debugging SQL much easier. Instead of generic syntax errors, users now get specific positions and suggestions, allowing them to fix issues without external help.

The Maintenance Win

The new parser is actually easier to maintain:

  • Adding a dialect is a configuration change, not a fork
  • New operators go in the precedence table (1 line)
  • Error improvements help us too during development
  • Property tests catch regressions we'd never think to test

What This Teaches About System Design

Building this parser taught us that performance problems are often architecture problems in disguise. We thought we needed a faster parser. What we actually needed was:

  • A different memory model (zero-copy)
  • A different error strategy (furthest tracking)
  • A different abstraction boundary (parse vs. analyze)

Rust didn't just let us implement these ideas safely - its constraints led us to discover them.

Try the Demo

We've extracted the core concepts into a working demo:

git clone https://github.com/zhihanz/sql-parser-demo
cd sql_parser_demo

# See error tracking in action
cargo run

The demo includes:

  • Complete error tracking with furthest position
  • Zero-copy tokenization and AST
  • Expression parsing with precedence climbing
  • CTE parsing without recursion
  • Comprehensive tests

Conclusion: Lessons for Your Parser

Our journey taught us that the most challenging problems often hide architectural truths. If you're building or improving a parser, consider these key takeaways:

  • Treat errors as a first-class feature. The most helpful error is the one that occurs furthest into the parse, as it confirms everything before it is correct.
  • Embrace a zero-copy architecture. By borrowing from the input string instead of allocating new ones, you can drastically reduce memory overhead.
  • Strictly separate syntax from semantics. A parser's only job is to validate grammar. Defer semantic analysis, like validating CTE recursion, to a later stage. This simplifies the parser and prevents bugs.

Here are the practical lessons we learned:

What We'd Do Differently

  • Start with property testing from day one
  • Build the error tracking system before the parser
  • Separate parse and analyze phases from the beginning

Building a parser is a journey of discovering what you actually need versus what you think you need. For us, Rust was the perfect companion on that journey - strict enough to prevent mistakes, flexible enough to express our ideas, and fast enough to handle production load.


Resources

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!