I setup an Advent of Code project for my org. Here is what I learnt

This is a copy of my blog post at https://sites.google.com/view/raybellwaves/blog/advent-of-code-on-palantir

For this year’s Advent of Code (AOC) I went OTT (over the top) setting up my dev environment! AOC is always a good chance to learn something new. This year i’m using Palantir Foundry. The puzzles are small enough that you can solve them on your laptop (or even mechanical devices!) but I want to learn about automation on Palantir and ways to organize/simplify a project for many users.

The goals is: how can I provide a dev environment for myself and my colleagues with minimal steps?

I automated as much as I could in a couple of hours. I didn’t want to be the embodiment of https://xkcd.com/1319/ so there are still some things left on the table and maybe i’ll come back to this after chatting with colleagues and Palantir Engineers.

Palantir has two types of “Datasets”: A tabular file or a folder of generic files. AOC puzzle inputs are text files. I opted to use a folder of files where the raw text files are uploaded. The files need a naming convention to make life easier for everyone involved (wonder if you could add checks for the type of data added to the folder such as they must have a .txt suffix and it must match a naming convention).

For AOC, each day there is a sample input for you to work on and try to get the expected answer. These files will be named “sample_day_DD_part_N.txt”. User files should be named “firstname_lastname_day_DD_input.txt”. These files are then dragged into the UI.

In theory there is no need to process these files as you can work with the text files in any language. However, to easily access these files across all Palantir tools we are going to store then as a table with each row being a VARCHAR of the text file.

I don’t need full-blown spark for this so i’m just using polars. The code to save the text file as a dataframe is:

with open("file.txt", "rb") as f:
    lines = f.readlines()
df = pl.DataFrame({"input": [line.strip().decode() for line in lines]})

We want to apply this to all files in the folder so we’ll use a transform. We also need to use a transform generator to run these in parallel:

def transform_generator(file_names: list):
transforms = []
for file_name in file_names:
    @lightweight
    @transform(my_input=Input("ri.foundry.main.dataset.XXX"), my_output=Output(f"/FOLDER/{file_name}"))
    def text_file_to_table(my_input, my_output, file_name):
    try:
        _file = next(my_input.filesystem().ls(glob=f"{file_name}.txt"), None)
        with my_input.filesystem().open(_file.path, "rb") as f:
        lines = f.readlines()
        my_output.write_table(pl.DataFrame({"input": [line.strip().decode() for line in lines]}))
    except Exception as e:
    print(f"Error processing {file_name}: {str(e)}")
    transforms.append(text_file_to_table)
    return transforms
    file_names = ["...", "..."]
    TRANSFORMS = transform_generator(file_names)

(it would be nice to have this build trigger when a file is uploaded)

Next we will access these files in a Jupyter notebook. Palantir keeps things as separate as possible by default for security and audibility. To access Datasets in a Jupyter notebook you have to click an import button. When you do this this create a file called .foundry/aliases.yml with the RID’s (resource Identifier). Ain’t nobody got time to click import 25 times so there is a way to automate this (it would be nice if there was a “import all datasets in a project button”.

First you need to create a dictionary which contains the datasets and there RID’s. I stumbled a bit here:

  • I was getting permission errors using the Foundry Platform SDK.
  • Jupyter notebooks have some environment variables such as TOKEN and HOSTNAME but I couldn’t use these to authenticate. I’m not sure if I need to setup an egress policy for this.

I worked out a solution using foundry-dev-tools:

from foundry_dev_tools import FoundryContext
ctx = FoundryContext()
children = list(ctx.compass.get_child_objects_of_folder(folder_ri))
data = {f["name"]: {"rid": f["rid"]} for f in children}

(would be nice to automate this on start up in a jupyter notebook).

This contains all the processed files and therefore needs to be filtered to just the user, the sample data and the raw dataset.

user = os.environ["GIT_AUTHOR_NAME"].lower().replace(" ", "_")
filter_terms = [user, "sample", "raw_puzzle_input_text_files"]
filtered_data = {k: v for k, v in data.items() if any(term in k.lower() for term in filter_terms)}

Now the files can to saved to the expected yaml file

with open('.foundry/aliases.yml', 'w') as f:
    yaml.dump(filtered_data, f, default_flow_style=False)

This logic is saved in a python file (it would be nice to run this on start up for any new jupyter workspace in a project).

A user can open a notebook and run

%run /home/user/repo/import_datasets.py

They can then work with the data as:

from foundry.transforms import Dataset
# Read the processed data in a format of your choice
file = "sample_day_01_part_1"
arrow_table = Dataset.get(file).read_table(format="arrow")
pandas_df = Dataset.get(file).read_table(format="pandas")
try:
    polars_df = Dataset.get(file).read_table(format="polars")
except ModuleNotFoundError:
    polars_df = None

# Create your own logic to parse the text files
local_file = (
Dataset("raw_puzzle_input_text_files")
.files()
.filter(lambda f: f.path == f"{file}.txt")
.download()
)

with open(local_file[f"{file}.txt"], "r") as f:
    lines = f.readlines()
11 Likes

Hi @raybellwaves , thank you for the feedback.

There is a button to “import all datasets in a project”: we support multi-select when importing datasets. So if all your files are in the same Compass folder, you can simply select them all in the “Import dataset” dialog to register and import them all at once, instead of using foundry-dev-tools.
Relying on the Compass folder or dataset name is not recommended, because dataset names and Compass folders are mutable so your code would stop working if someone renamed a folder or a dataset.

I understand the value in your use case, but in general it’s often a bad design to read from a very large number of datasets, because each dataset is its own unit of security - so reading from thousands of datasets each containing a single file would negatively impact performance. Instead, for production workflows, we recommend uploading all non-structured files that will be processed simultaneously to a single dataset, and to unify tabular data representing the same type of data into a small number of large datasets.

Finally, there is a slightly simpler syntax to read a specific file from a dataset:

local_file = Dataset("raw_puzzle_input_text_files").files().get(f"{file}.txt").download()

with open(local_file, "r") as f:
    lines = f.readlines()
1 Like

Thanks for the feedback.

Hoping to learn best practices.

Yes I get what you are saying. It’s better to have all these files in one tabular dataset and have the users filter what they need for performance reasons rather than lots of individual files.

This was quite easy to code

def text_files_to_table(
    my_input: LightweightInput,
    my_output: LightweightOutput
):
    files = my_input.filesystem().ls()
    df = pl.DataFrame(schema={"input": str, "file_name": str})
    for file_name in files:
        try:
            with my_input.filesystem().open(file_name.path, "rb") as f:
                lines = f.readlines()
            _df = pl.DataFrame({"input": [line.strip().decode() for line in lines]}).with_columns(
                pl.lit(file_name.path).alias("file_name")
            )
            df = df.extend(_df)
        except Exception as e:
            print(f"Error processing {file_name}: {str(e)}")
    my_output.write_table(df)

but i’m not sure if LightweightOutput’s has the ability to write the partitioned tables hence I just did a a concat here. The dataset is small enough that it should be fine.

Thanks for the tip re. get