I have a dataset where I’m uploading raw parquet files. To extract certain pieces of information, it will be useful to have the file path as a column. I’m able to add this file path column using code like this in a pyspark transformation:
@transform(
input_data=Input("/path/to/dataset/with/parquet/files"),
output=Output("/path/to/dataset/with/parquet/file/names")
)
def add_filename_column(ctx: TransformContext, input_data: TransformInput, output: TransformOutput):
# Read the Parquet files into a DataFrame
df = input_data.dataframe()
# Add the input_file_name() as a new column
df_with_filename = df.withColumn("source_file_path", F.input_file_name())
# Extract just the file name from the full path
df_with_filename = df_with_filename.withColumn(
"source_filename",
F.regexp_extract(F.col("source_file_path"), r"/([^/]+)$", 1)
)
# Write the result to the output dataset
output.write_dataframe(df_with_filename)
But is there any way to just have this information automatically added to the raw dataset like we support for datasets containing raw CSV files?