How can I read delta tables in foundry?

kfjensen · June 11, 2024, 1:00pm

Hey,

I’ve been trying to ingest delta tables from an Azure Data Lake Storage source. This works fine when I use Virtual Tables. However, Virtual Tables requires that I have a direct connection and I’ve now been asked to ingest from a source that requires the use of a Data Connection Agent.

What is the best way to ingest the data and convert it to a dataset in Foundry?

kfjensen · June 11, 2024, 1:06pm

Spark has a default reader for delta format that you can use to convert the delta files to a spark dataframe, and subsequently write as a foundry dataset. The basic code for this is:

@transform(
    out=Output("/path/to/foundry/dataset"),
    raw=Input("/path/to/raw/delta/files"),
)
def process_delta_table(ctx, raw, out):
    hadoop_path = raw.filesystem().hadoop_path
    df = ctx.spark_session.read.format('delta').load(hadoop_path)
    out.write_dataframe(df)

Thus, you can ingest your raw delta tables as .crc/.json.parquet files into foundry using any of the supported source (for example: ABFS. Once the files land as raw files in Foundry, you can use the snippet above to convert them to a spark dataframe.