Column-Level Lineage Design
Key Design Principles
sqllineage will stay primarily as a static code analysis tool, so we must tolerate information missing when doing column-level lineage. Maybe somewhere in the future, we can provide some kind of plugin mechanism to register metadata as a supplement to refine the lineage result, but in no way will we depend solely on metadata.
Don’t create column-level lineage DAG to be a separate graph from table-level DAG. There should be one unified DAG. Either 1) we build a DAG to the granularity of column so with some kind of transformation, we can derive table-level DAG from it. To use an analogy of relational data, it’s like building a detail table, with the ability to aggregate up to a summary table. Or 2) we build a property graph (see JanusGraph docs for reference), with Table and Column as two kinds of Vertex, and two type of edges, one for column to table relationship mapping, one for column-level as well as table-level lineage.
Questions Before Implementation
What’s the data structure to represent Column-Level Lineage? Currently we’re using
networkxto represent Table-Level Lineage, with table as vertex and table-level lineage as edge, which is pretty straight forward. After changing to Column-Level, what’s the plan?
Answer: See design principle for two possible data structure. I would prefer property graph for this.
How do we deal with
select *? In following case, we don’t know which columns are in tab2
INSERT OVERWRITE tab1 SELECT * FROM tab2;
Answer: Add an virtual column * for each Table as a special column. So the lineage is tab2.* -> tab1.*
How do we deal with column without table/alias prefix in case of join? In following case, we don’t know whether col2 is coming from tab2 or tab3
INSERT OVERWRITE tab1 SELECT col2 FROM tab2 JOIN tab3 ON tab2.col1 = tab3.col1
Answer: dd two edges, tab2.col2 -> tab1.col2, tab3.col2 -> tab1.col2. Meanwhile, these two edges should be marked so that later in visualization, they can be drawn differently, like in dot line.
With 6308b50 splitting the logic into different handlers, we now have SourceHandler, TargetHandler and CTEHandler to handle table level lineage. They’re subclass of NextTokenBaseHandler, an abstract class to address an extract pattern when a specified token indicates we should extract something from next token.
A newly introduced ColumnHandler will also be based on NextTokenBaseHandler (column token followed by keyword SELECT) plus a end-of-(sub)query hook. Because only until end of query could we know all the source tables. If we don’t have all the source tables and their alias, we can’t assign the column to table correctly.
Steps for Full Implementation
Atomic column logic handling: alias, case when, function, expression, etc.
Subquery recognition and lineage transition from subquery to statement
Column to table assignment in case of table join
Assemble Statement Level lineage into multiple statements DAG.
Column-Level lineage is now released with v1.3.0