How can I create a dataset from a parsing of a file loaded as a dataset?

lucapwc · July 5, 2024, 8:48am

I have an mf4 file that with the use of the mdfreader library can be converted into a pandas dataframe and then pyspark. I loaded the file as an unstructured dataset and wrote a transformation that reads the file from filesystem using hadoop_path, converts it and creates the dataframe. Everything works in preview but when I launch the build the hadoop path changes and no longer finds the file. Does anyone know how to solve this?

taylor · July 5, 2024, 6:08pm

I’m making some assumptions about your code here, but instead of using something like my_input.filesystem().hadoop_path, try the following:

df = (
    my_input.filesystem()
    .files("**/*.mf4")
    .rdd.flatMap(your_function_that_processes_mf4_files)
    .toDF(schema_defined_elsewhere_in_your_code)
)