DOs and DONTs
SQLLineage is a static SQL code lineage analysis tool, which means we will not try to execute the SQL code against any kinds of server. Instead, we will just look at the code as text, parse it and give back the result. No client/server interaction.
SQLLineage will continue to support most commonly used SQL system. Make best effort to be compatible.
SQLLineage will stay mainly as a command line tool, as well as a Python utils library.
Column-level lineage will not be 100% accurate because that would require metadata information. However, there’s no unified metadata service for all kinds of SQL systems. For the moment, in column-level lineage, column-to-table resolution is conducted in a best-effort way, meaning we only provide possible table candidates for situation like
select col from tab1 join tab2.
Likewise for Partition-level lineage. Until we find a way to not involve metadata service, we will not go for this.
100% accurate Column-level lineage is still do-able if we can provide some kind of a plugin system for user to register their metadata instead of us maintaining it. Let’s see what will happen in future versions.
Static Code Analysis Approach Explained
This is the typical flow of how SQL is parsed and executed from compiling perspective:
Lexical Analysis: transform SQL string into token stream.
Parsing: transform token stream into unresolved AST.
Semantic Analysis: annotate unresolved AST with real table/column reference using database catalog.
Optimize with Intermediate Representation: transform AST to execution plan, and optimize with predicate pushdown, column pruning, boolean simplification, limit combination, join strategy, etc.
Code Gen: This only makes sense in distributed SQL system. Generating primitive on underlying computing framework, like MapReduce Job, or RDD operation, based on the optimized execution plan.
Semantic analysis is a compiler term. In data world, it’s often referred to as catalog resolution. For some systems, unresolved AST is transformed to unsolved execution plan first. With catalog resolution, a resolved execution plan is then ready for later optimization phases.
SQLLineage is working on the abstraction layer of unresolved AST, right after parsing phase is done. The good side is that
SQLLineage is dialect-agnostic and able to function without database catalog. The bad side is of course the inaccuracy
on column-level lineage when we don’t know what’s behind
The alternative way is starting the lineage analysis on the abstraction layer of resolved AST, or execution plan. That ties lineage analysis tightly with the SQL system so it won’t function without a live connection to database. But that will give user an accurate result and the source code of database can be used to save a lot of coding effort.
To combine the good side of both approaches, in the long term, SQLLineage will introduce an optional resolution phase, followed by the current unresolved lineage result, where user can register metadata information in a programmatic way.