Is it possible to build a pyspark transform with an empty input/output so that it can be used with any dataset nodes in pipeline builder?
for reference, I’m using this code to infer schema on an unstructured dataset (from file upload) but if I don’t specify a hardcoded dataset for input/output, the transform doesn’t build. I tried writing it as a Python function instead, but don’t think the infer schema functionality is supported. Having trouble finding documentation on if using an UDF is even possible with the type of transform I’m using.
Can you paint a picture of your end2end workflow?
I have a unstructured dataset from a file upload that will have new files pushed each day. I want to be able to automatically apply schema, preferably as the first step of a pipeline builder.
is the schema constantly evolving?
Which system or process is pushing the files? Could this system also call the schema inference API and put schema API after upload?
are you doing snapshot transactions or updates?