Transforms for pipeline builder

5e0c48d11649eab75c54 · August 12, 2025, 4:42pm

Is it possible to build a pyspark transform with an empty input/output so that it can be used with any dataset nodes in pipeline builder?
for reference, I’m using this code to infer schema on an unstructured dataset (from file upload) but if I don’t specify a hardcoded dataset for input/output, the transform doesn’t build. I tried writing it as a Python function instead, but don’t think the infer schema functionality is supported. Having trouble finding documentation on if using an UDF is even possible with the type of transform I’m using.

nicornk · August 12, 2025, 7:42pm

Can you paint a picture of your end2end workflow?

5e0c48d11649eab75c54 · August 12, 2025, 7:43pm

I have a unstructured dataset from a file upload that will have new files pushed each day. I want to be able to automatically apply schema, preferably as the first step of a pipeline builder.

nicornk · August 12, 2025, 9:26pm

is the schema constantly evolving?

Which system or process is pushing the files? Could this system also call the schema inference API and put schema API after upload?

are you doing snapshot transactions or updates?