I have an mf4 file that with the use of the mdfreader library can be converted into a pandas dataframe and then pyspark. I loaded the file as an unstructured dataset and wrote a transformation that reads the file from filesystem using hadoop_path, converts it and creates the dataframe. Everything works in preview but when I launch the build the hadoop path changes and no longer finds the file. Does anyone know how to solve this?
I’m making some assumptions about your code here, but instead of using something like my_input.filesystem().hadoop_path, try the following:
df = (
my_input.filesystem()
.files("**/*.mf4")
.rdd.flatMap(your_function_that_processes_mf4_files)
.toDF(schema_defined_elsewhere_in_your_code)
)