Hey, I have a dataset that writes out a snapshot transaction of a text file (with no schema). After manually providing a schema, I’m able to read the dataset as an input downstream no problemo. However, every time I create a new snapshot transaction, the schema is lost and I have to manually add it again - anyone know how to keep the schema persistent across transactions?
There are a few ways of doing it. For some reason I can’t add a link to my post, but Google the phrase ‘One Weird Trick to Fix Your Pyspark Schemas’ for an elegant solution to your problem.
If this schema is static, you could use an API call to upload it after you wrote the file in your transform. The value for schema
would be present in the details → schema tab of your dataset after you manually inferred the schema. You need to pass ctx
into your transform and add foundry-dev-tools
as dependency from conda-forge:
# write your files without spark
# upload schema
from foundry_dev_tools import FoundryContext, JWTTokenProvider
schema = {}
fdt_ctx = FoundryContext(
token_provider=JWTTokenProvider(
host="<<stack>>.palantirfoundry.com",
jwt=ctx.auth_header.split(" ")[1],
)
)
fdt_dataset = fdt_ctx.get_dataset(
transform_output.rid, branch=transform_output.branch
)
_ = fdt_ctx.metadata.api_upload_dataset_schema(
transform_output.rid,
transaction_rid=fdt_dataset.get_open_transaction()["rid"],
schema=schema,
branch=transform_output.branch.branch,
)
1 Like
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.