Add raw file path as a column to a dataset containing parquet files

cdesouza · February 18, 2025, 5:56pm

I have a dataset where I’m uploading raw parquet files. To extract certain pieces of information, it will be useful to have the file path as a column. I’m able to add this file path column using code like this in a pyspark transformation:

@transform(
    input_data=Input("/path/to/dataset/with/parquet/files"),
    output=Output("/path/to/dataset/with/parquet/file/names")
)
def add_filename_column(ctx: TransformContext, input_data: TransformInput, output: TransformOutput):
   
    # Read the Parquet files into a DataFrame
    df = input_data.dataframe()
    
    # Add the input_file_name() as a new column
    df_with_filename = df.withColumn("source_file_path", F.input_file_name())
    
    # Extract just the file name from the full path
    df_with_filename = df_with_filename.withColumn(
        "source_filename", 
        F.regexp_extract(F.col("source_file_path"), r"/([^/]+)$", 1)
    )
    
    # Write the result to the output dataset
    output.write_dataframe(df_with_filename)

But is there any way to just have this information automatically added to the raw dataset like we support for datasets containing raw CSV files?

system · April 19, 2025, 5:57pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.