Create XML files from a dataset

Praveen_Sharma · November 26, 2024, 1:18pm

I have a requirement of transforming a dataset into XML files given a complex XSD. I am trying to use spark-xml for doing this task. However, I get an error “No module named ‘com’ found” even if conda jar “com.databricks.spark.xml” is included in the build.gradle file. If someone has any idea regarding this, could you please guide me? Also, please let me know if you have any alternate approaches to achieve this in Palantir foundry. Thanks in advance.

redboyben · November 26, 2024, 3:31pm

Just in case, could you review how you’re supposed to add a jar in build.gradle to make sure there’s nothing missing there? https://www.palantir.com/docs/foundry/transforms-python/environment-troubleshooting#packages-which-require-both-a-conda-package-and-a-jar

It could be another option write the files manually but this might not be adequate to your use case depending on scale.

from transforms.api import transform, Output
import xml.etree.ElementTree as ET

@transform(
    output=Output("output_dataset_path")
)
def write_xml(ctx, output):
    # Create an XML structure based on the XSD
    root = ET.Element("RootElement")  # Replace with your root element name from the XSD
    child = ET.SubElement(root, "ChildElement")  # Replace with your child element name
    child.set("attributeName", "attributeValue")  # Set attributes as defined in the XSD
    child.text = "Element Text"  # Set text content if required

    # Convert the XML structure to a string
    xml_str = ET.tostring(root, encoding='utf8', method='xml').decode()

    # Write the XML string to a file in the output filesystem
    output_fs = output.filesystem()
    with output_fs.open('output.xml', 'w') as f:
        f.write(xml_str)

sandpiper · December 2, 2024, 7:31am

I confirmed in my own environment that it is indeed possible to use spark-xml in Python transforms to output XML files to a dataset, so as suggested above, you should double-check your build.gradle (and make sure that you are editing transforms-python/build.gradle, not the top-level build.gradle). For reference, the below is the dependencies block at the bottom of transforms-python/build.gradle in my environment:

dependencies {
    condaJars 'com.databricks:spark-xml_2.13:0.18.0'
}

And both of the following code samples worked without issue:

output.write_dataframe(df, output_format="com.databricks.spark.xml")

output.write_dataframe(df, output_format="xml")

system · January 31, 2025, 7:32am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.