Dialect-Awareness Lineage Design
Problem Statement
As of v1.3.x release, table level lineage is perfectly production-ready. Column level lineage, under the no-metadata background, is also as good as it can be. And yet we still have a lot of corner cases that are not yet supported. This is really due to the long-tail of SQL language features and fragmentation of various SQL dialects.
Here are some typical issues:
How to check whether syntax is valid or not?
dialect specific syntax:
MSSQL assignment operator
Snowflake MERGE statement
CURRENT_TIMESTAMP: keyword or function?
identifier quote character: double quote or backtick?
dialect specific keywords:
reversed keyword vs non-reversed keyword list
non-reserved keyword as table name
non-reserved keyword as column name
dialect specific function:
Presto UNNEST
Snowflake GENERATOR
Over the years, we already have several monkey patches and utils on sqlparse to tweak the AST generated, either because of incorrect parsing result (e.g. parenthesized query followed by INSERT INTO table parsed as function) or not yet supported token grouping (e.g. window function for example). Due to the non-validating nature of sqlparse, that’s the bitter pill to swallow when we enjoyed tons of convenience.
Wishful Thinking
To move forward, we’d want more from the parser so that:
We know better what syntax, or dialect specific feature we support.
We can easily revise parsing rules to generate the AST we want when we decide to support some new features.
User can specify the dialect when they use sqllineage, so they know what to expect. And we explicitly let them know when we don’t know how to parse the SQL (InvalidSyntaxException) or how to analyze the lineage (UnsupportedStatementException).
Sample call from command line:
sqllineage -f test.sql --dialect=ansi
Sample call from Python API:
from sqllineage.runner import LineageRunner
sql = "select * from dual"
result = LineageRunner(sql, dialect="ansi")
Likewise in frontend UI, user have a dropdown select to choose the dialect they want.
Implementation Plan
OpenMetadata community contributed an implementation using the parser underneath sqlfluff. With #326 merged into master, we have a new dialect option. When passed with real dialect, like mysql, oracle, hive, sparksql, bigquery, snowflake, etc, we’ll leverage sqlfluff to analyze the query. A pseudo dialect non-validating is introduced to remain backward compatibility, falling back to use sqlparse as parser.
We’re running dual test using both parser and make sure the lineage result is exactly the same for every test case (except for a few edge cases).
From code structure perspective, we refactored the whole code base to introduce a parser interface:
LineageAnalyzer now accepts single statement SQL string, split by LineageRunner, and returns StatementLineageHolder as before
Each parser implementations sit in folder sqllineage.core.parser. They’re extending the LineageAnalyzer, common Models, and leverage Holders at different layers.
Note
Dialect-awareness lineage is now released with v1.4.0