I would like to output a file with the xlsx or xls extension in Code Repository.
I can output files with the csv extension.
However, I would like to output in an Excel file.
Is this possible?
Hi,
You can write any type of file to a dataset by using the filesystem API on the output of your transform
https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files
I have already tried that method.
I also tried the code answered by Palantir AIP, but without success.
from transforms.api import transform, Input, Output
import pandas as pd
import openpyxl
@transform(
output=Output("/path/to/output/xlsx"),
my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
# Read the input data into a Pandas DataFrame
df = my_input.dataframe()
# Perform any necessary transformations
# For example, filtering or aggregating data
df_filtered = df[df['column_name'] == 'desired_value']
# Write the DataFrame to an xlsx file
output_path = "/path/to/output/file.xlsx"
df_filtered.to_excel(output_path, index=False, engine='openpyxl')
# Save the xlsx file to the output dataset
output.write_dataframe(pd.read_excel(output_path))
That code sample from AIP Assist is almost correct - there are just a few small mistakes in it. I confirmed that the following, slightly modified snippet works:
import shutil
import tempfile
from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
@transform(
output=Output("/path/to/output/xlsx"),
flights=Input("/path/to/input/flights/dataset"),
)
def my_compute_function(output, flights):
# Read the input data into a Pandas DataFrame
df = flights.dataframe()
# Perform any necessary transformations
# For example, filtering or aggregating data
df_filtered = df.filter(F.col("origin") == F.lit("BDL"))
# Write the DataFrame to an xlsx file
with tempfile.NamedTemporaryFile(delete=False) as temp:
temp.close()
df_filtered.toPandas().to_excel(temp.name, index=False, engine="openpyxl")
# Save the xlsx file to the output dataset
with output.filesystem().open("flights.xlsx", "wb") as output_fp:
with open(temp.name, "rb") as input_fp:
shutil.copyfileobj(input_fp, output_fp)
Note that you will need to add both pandas
and openpyxl
as dependencies for this code to work; see the documentation for details on how to manage library dependencies.
2 Likes
This code worked successfully.
Thank you for letting me know.