How to output xlsx (or xls) file extension in Code-Repositories

Chiba · November 29, 2024, 11:17am

I would like to output a file with the xlsx or xls extension in Code Repository.
I can output files with the csv extension.
However, I would like to output in an Excel file.
Is this possible?

cpottiez · November 29, 2024, 12:55pm

Hi,
You can write any type of file to a dataset by using the filesystem API on the output of your transform
https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files

Chiba · December 5, 2024, 8:57am

I have already tried that method.
I also tried the code answered by Palantir AIP, but without success.

from transforms.api import transform, Input, Output
import pandas as pd
import openpyxl

@transform(
    output=Output("/path/to/output/xlsx"),
    my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
    # Read the input data into a Pandas DataFrame
    df = my_input.dataframe()
    # Perform any necessary transformations
    # For example, filtering or aggregating data
    df_filtered = df[df['column_name'] == 'desired_value']
    # Write the DataFrame to an xlsx file
    output_path = "/path/to/output/file.xlsx"
    df_filtered.to_excel(output_path, index=False, engine='openpyxl')
    # Save the xlsx file to the output dataset
    output.write_dataframe(pd.read_excel(output_path))

sandpiper · December 5, 2024, 12:50pm

That code sample from AIP Assist is almost correct - there are just a few small mistakes in it. I confirmed that the following, slightly modified snippet works:

import shutil
import tempfile

from pyspark.sql import functions as F
from transforms.api import transform, Input, Output


@transform(
    output=Output("/path/to/output/xlsx"),
    flights=Input("/path/to/input/flights/dataset"),
)
def my_compute_function(output, flights):
    # Read the input data into a Pandas DataFrame
    df = flights.dataframe()
    # Perform any necessary transformations
    # For example, filtering or aggregating data
    df_filtered = df.filter(F.col("origin") == F.lit("BDL"))
    # Write the DataFrame to an xlsx file
    with tempfile.NamedTemporaryFile(delete=False) as temp:
        temp.close()
        df_filtered.toPandas().to_excel(temp.name, index=False, engine="openpyxl")
        # Save the xlsx file to the output dataset
        with output.filesystem().open("flights.xlsx", "wb") as output_fp:
            with open(temp.name, "rb") as input_fp:
                shutil.copyfileobj(input_fp, output_fp)

Note that you will need to add both pandas and openpyxl as dependencies for this code to work; see the documentation for details on how to manage library dependencies.

Chiba · December 6, 2024, 9:26am

This code worked successfully.
Thank you for letting me know.

system · December 20, 2024, 9:27am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.