How Does SQLLineage Work
Basically a sql parser will parse the SQL statement(s) into AST (Abstract Syntax Tree), which according to wikipedia, is a tree representation of the abstract syntactic structure of source code (in our case, SQL code, of course). This is where SQLLineage takes over.
With AST generated, SQLLineage will traverse through this tree, apply some pre-defined rules, so as to extract the part we’re interested in. With that being said, SQLLineage is an AST application, while there’s actually more you can do with AST:
born duty of AST: the starting point for optimization. In compiler world, machine code, or optionally IR (Intermediate Representation), will be generated based on the AST, and then code optimization, resulting in an optimized machine code. In data world, it’s basically the same thing with different words, and different optimization target. AST will be converted to query execution plan for query execution optimization, using strategy like RBO(Rule Based Optimization) or CBO(Cost Based Optimization), so that database/data warehouse query engine can have an optimized physical plan for execution.
structure analysis: IDE leverages this a lot. Scenarios like duplicate code detection, code refactor. Basically this is to analyze the code structure. SQLLineage also falls into this category.
sqlparse is the underlying parser SQLLineage uses to get the AST. It gives a simple example to extract table names, through which you can get a rough idea of how SQLLineage works. At the core is when a token is Keyword and its value is “FROM”, then the next token will either be subquery or table. For subquery, we just recursively calling extract function. For table, there’s a way to get its name.
This is just an over-simplified explanation. In reality, we could easily see
Comment coming after “FROM”, or
subquery without alias (valid syntax in certain SQL dialect) mistakenly parsed as
Parenthesis. These are all
corner cases we should resolve in real world.
Strictly speaking, sqlparse is generating a parse tree instead of an abstract syntax tree. There two terms are often used interchangeably, and indeed they’re similar conceptually. They’re both tree structure in slightly different abstraction layer. In the AST, information like comments and grouping symbols (parenthesis) are not represented. Removing comment doesn’t change the code logic and parenthesis are already implicitly represented by the tree structure.
Some other simple rules in SQLLineage:
Things go after Keyword “FROM”, all kinds of “JOIN” will be source table.
Things go after Keyword “INTO”, “OVERWRITE”, “TABLE”, “VIEW” will be target table. (Though there are exceptions like drop table statement)
Things go after Keyword “With” will be CTE (Common Table Expression).
Things go after Keyword “SELECT” will be column(s).
The rest thing is just tedious work. We collect all kinds of sql, handle various edge cases and make these simple rules robust enough.
That’s it for single statement SQL lineage analysis. For multiple statements SQL, it requires some more extra work to assemble the lineage from single statements.
We choose a DAG based data structure to represent multiple statements SQL lineage. Table/View will be vertex in this graph while a edge means data in source vertex table will contribute to data in target vertex table. In column-level lineage, the vertex is a column. Every single statement lineage result will contain table/column read and table/column write information, which will later be combined into this graph. With this DAG based data structure, it is also very easy to visualize lineage.