Hey,
I’ve been trying to ingest delta tables from an Azure Data Lake Storage source. This works fine when I use Virtual Tables. However, Virtual Tables requires that I have a direct connection and I’ve now been asked to ingest from a source that requires the use of a Data Connection Agent.
What is the best way to ingest the data and convert it to a dataset in Foundry?
Spark has a default reader for delta format that you can use to convert the delta files to a spark dataframe, and subsequently write as a foundry dataset. The basic code for this is:
@transform(
out=Output("/path/to/foundry/dataset"),
raw=Input("/path/to/raw/delta/files"),
)
def process_delta_table(ctx, raw, out):
hadoop_path = raw.filesystem().hadoop_path
df = ctx.spark_session.read.format('delta').load(hadoop_path)
out.write_dataframe(df)
Thus, you can ingest your raw delta tables as .crc/.json.parquet files into foundry using any of the supported source (for example: ABFS. Once the files land as raw files in Foundry, you can use the snippet above to convert them to a spark dataframe.
1 Like