Getting Input File Metadata in Pipeline Builder for non-CSV Inputs

Context: As explained at https://www.palantir.com/docs/foundry/data-integration/csv-parsing/#textdataframereader-options, for CSV datasets, it is possible to get input file metadata such as file path and imported timestamp via schema options. Additionally, in a Code Repository, it is possible to use the combination of pyspark.sql.functions.input_file_name() and a join with the dataframe returned from the files() method of a Filesystem object to retrieve this information for an input dataset of any file type.

Question: Is there any way to get file metadata such as file name and imported timestamp for a parquet input dataset in Pipeline Builder, or is it necessary to use the abovementioned Code Repository-based method?

1 Like

Hey @sandpiper in Pipeline Builder we actually just added the functionality to get the file path (should see it on your environment in a few days) and we’re currently working on adding in the timestamp

1 Like

Hello, can anyone point to the exact transformation in Pipeline Builder that does this?

It’s called “Extract file metadata from dataset as rows”

This is only available if one imports file as xlsx file type or maybe other type but not csv. For csv the way to do it but manually intervention is to open dataset after importing in Pipeline Builder, click edit schema, go to additional columns and add the filepath then using split or regex to retain only the filename not the full file path.

Extract file metadata from dataset as rows should also be supported for CSV files. If you aren’t able to see it (make sure your input is still the list raw csv files) then let us know and we can help debug

Thank you @helenq , but this transformation is only available for raw files not structured imported datasets. Anyways, is good for now with this option.