I have an mf4 file that with the use of the mdfreader library can be converted into a pandas dataframe and then pyspark. I loaded the file as an unstructured dataset and wrote a transformation that reads the file from filesystem using hadoop_path, converts it and creates the dataframe. Everything works in preview but when I launch the build the hadoop path changes and no longer finds the file. Does anyone know how to solve this?
1 Like
Iām making some assumptions about your code here, but instead of using something like my_input.filesystem().hadoop_path
, try the following:
df = (
my_input.filesystem()
.files("**/*.mf4")
.rdd.flatMap(your_function_that_processes_mf4_files)
.toDF(schema_defined_elsewhere_in_your_code)
)
1 Like