I have a requirement of transforming a dataset into XML files given a complex XSD. I am trying to use spark-xml for doing this task. However, I get an error “No module named ‘com’ found” even if conda jar “com.databricks.spark.xml” is included in the build.gradle file. If someone has any idea regarding this, could you please guide me? Also, please let me know if you have any alternate approaches to achieve this in Palantir foundry. Thanks in advance.
Just in case, could you review how you’re supposed to add a jar in build.gradle to make sure there’s nothing missing there? https://www.palantir.com/docs/foundry/transforms-python/environment-troubleshooting#packages-which-require-both-a-conda-package-and-a-jar
It could be another option write the files manually but this might not be adequate to your use case depending on scale.
from transforms.api import transform, Output
import xml.etree.ElementTree as ET
@transform(
output=Output("output_dataset_path")
)
def write_xml(ctx, output):
# Create an XML structure based on the XSD
root = ET.Element("RootElement") # Replace with your root element name from the XSD
child = ET.SubElement(root, "ChildElement") # Replace with your child element name
child.set("attributeName", "attributeValue") # Set attributes as defined in the XSD
child.text = "Element Text" # Set text content if required
# Convert the XML structure to a string
xml_str = ET.tostring(root, encoding='utf8', method='xml').decode()
# Write the XML string to a file in the output filesystem
output_fs = output.filesystem()
with output_fs.open('output.xml', 'w') as f:
f.write(xml_str)
I confirmed in my own environment that it is indeed possible to use spark-xml in Python transforms to output XML files to a dataset, so as suggested above, you should double-check your build.gradle (and make sure that you are editing transforms-python/build.gradle
, not the top-level build.gradle
). For reference, the below is the dependencies block at the bottom of transforms-python/build.gradle
in my environment:
dependencies {
condaJars 'com.databricks:spark-xml_2.13:0.18.0'
}
And both of the following code samples worked without issue:
output.write_dataframe(df, output_format="com.databricks.spark.xml")
output.write_dataframe(df, output_format="xml")