Filter in data lineage by Schema

bziane · September 9, 2024, 6:45am

Hello,

I have a pipeline that union more than 300 datasets (that should have the same schema).
I’m using long in the Ingestion of timestamps without time zones option for the syncs.

Sometimes, we forgot to add this options, and by default, the timestamps will come as StringType instead of LongType which leads to have schema mismatch in the pipeline.

Is there any way in data Lineage to filter by schema (ColumnName and DataType) to know which dataset have been ingested with StringType for datetime columns?

I have tried below options, but didn’t get the desired results:

Use multi select in data lineage, use FREQUENT COLUMNS in Histogram bar.
Use advanced search in data lineage ( Columns matchs regex )

image572×1125 114 KB

this option shows all the dataset that contains the regex pattern in columns, and it shows also the columns datatype, but I need to scroll for more than 200 dataset to see which columns have StringType which is not efficient way.

cjoyner · September 30, 2024, 6:31pm

Hey @bziane! Thanks for the message. I especially like the way you added screenshots for context (and blurred PII). I’ve run across a project like this before and I’m happy to share how I approached it.

You’re right that Data Lineage doesn’t allow you to search/filter a broad swath of datasets in the way you’ve described. To account for this, I would suggest you structure your pipeline in such a way that raw datasets go through an initial cleaning/preparation step upon ingestion. In this case you’d ensure that your timestamp column reliably outputs in the format you’ll need in your downstream transforms or later in your Ontology layer.

cjoyner · September 30, 2024, 6:33pm

Your initial ask seems like a very interesting feature request and its great that you surfaced it in this forum! We’re always looking to gather feedback from users and love hearing firsthand experiences like this.