Consistent Schema across transactions

Hey, I have a dataset that writes out a snapshot transaction of a text file (with no schema). After manually providing a schema, I’m able to read the dataset as an input downstream no problemo. However, every time I create a new snapshot transaction, the schema is lost and I have to manually add it again - anyone know how to keep the schema persistent across transactions?

There are a few ways of doing it. For some reason I can’t add a link to my post, but Google the phrase ‘One Weird Trick to Fix Your Pyspark Schemas’ for an elegant solution to your problem.

If this schema is static, you could use an API call to upload it after you wrote the file in your transform. The value for schema would be present in the details → schema tab of your dataset after you manually inferred the schema. You need to pass ctx into your transform and add foundry-dev-tools as dependency from conda-forge:

# write your files without spark

# upload schema
from foundry_dev_tools import FoundryContext, JWTTokenProvider

schema = {}

fdt_ctx = FoundryContext(
                        token_provider=JWTTokenProvider(
                            host="<<stack>>.palantirfoundry.com",
                            jwt=ctx.auth_header.split(" ")[1],
                        )
                    )
fdt_dataset = fdt_ctx.get_dataset(
                                transform_output.rid, branch=transform_output.branch
                            )
_ = fdt_ctx.metadata.api_upload_dataset_schema(
                                    transform_output.rid,
                                    transaction_rid=fdt_dataset.get_open_transaction()["rid"],
                                    schema=schema,
                                    branch=transform_output.branch.branch,
                                )
1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.