Add raw file path as a column to a dataset containing parquet files

I have a dataset where I’m uploading raw parquet files. To extract certain pieces of information, it will be useful to have the file path as a column. I’m able to add this file path column using code like this in a pyspark transformation:

@transform(
    input_data=Input("/path/to/dataset/with/parquet/files"),
    output=Output("/path/to/dataset/with/parquet/file/names")
)
def add_filename_column(ctx: TransformContext, input_data: TransformInput, output: TransformOutput):
   
    # Read the Parquet files into a DataFrame
    df = input_data.dataframe()
    
    # Add the input_file_name() as a new column
    df_with_filename = df.withColumn("source_file_path", F.input_file_name())
    
    # Extract just the file name from the full path
    df_with_filename = df_with_filename.withColumn(
        "source_filename", 
        F.regexp_extract(F.col("source_file_path"), r"/([^/]+)$", 1)
    )
    
    # Write the result to the output dataset
    output.write_dataframe(df_with_filename)

But is there any way to just have this information automatically added to the raw dataset like we support for datasets containing raw CSV files?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.