I would like to output a file with the xlsx or xls extension in Code Repository.
I can output files with the csv extension.
However, I would like to output in an Excel file.
Is this possible?
Hi,
You can write any type of file to a dataset by using the filesystem API on the output of your transform
https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files
I have already tried that method.
I also tried the code answered by Palantir AIP, but without success.
from transforms.api import transform, Input, Output
import pandas as pd
import openpyxl
@transform(
output=Output("/path/to/output/xlsx"),
my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
# Read the input data into a Pandas DataFrame
df = my_input.dataframe()
# Perform any necessary transformations
# For example, filtering or aggregating data
df_filtered = df[df['column_name'] == 'desired_value']
# Write the DataFrame to an xlsx file
output_path = "/path/to/output/file.xlsx"
df_filtered.to_excel(output_path, index=False, engine='openpyxl')
# Save the xlsx file to the output dataset
output.write_dataframe(pd.read_excel(output_path))
That code sample from AIP Assist is almost correct - there are just a few small mistakes in it. I confirmed that the following, slightly modified snippet works:
import shutil
import tempfile
from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
@transform(
output=Output("/path/to/output/xlsx"),
flights=Input("/path/to/input/flights/dataset"),
)
def my_compute_function(output, flights):
# Read the input data into a Pandas DataFrame
df = flights.dataframe()
# Perform any necessary transformations
# For example, filtering or aggregating data
df_filtered = df.filter(F.col("origin") == F.lit("BDL"))
# Write the DataFrame to an xlsx file
with tempfile.NamedTemporaryFile(delete=False) as temp:
temp.close()
df_filtered.toPandas().to_excel(temp.name, index=False, engine="openpyxl")
# Save the xlsx file to the output dataset
with output.filesystem().open("flights.xlsx", "wb") as output_fp:
with open(temp.name, "rb") as input_fp:
shutil.copyfileobj(input_fp, output_fp)
Note that you will need to add both pandas
and openpyxl
as dependencies for this code to work; see the documentation for details on how to manage library dependencies.
This code worked successfully.
Thank you for letting me know.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.