This is a copy of my blog post at https://sites.google.com/view/raybellwaves/blog/advent-of-code-on-palantir
For this year’s Advent of Code (AOC) I went OTT (over the top) setting up my dev environment! AOC is always a good chance to learn something new. This year i’m using Palantir Foundry. The puzzles are small enough that you can solve them on your laptop (or even mechanical devices!) but I want to learn about automation on Palantir and ways to organize/simplify a project for many users.
The goals is: how can I provide a dev environment for myself and my colleagues with minimal steps?
I automated as much as I could in a couple of hours. I didn’t want to be the embodiment of https://xkcd.com/1319/ so there are still some things left on the table and maybe i’ll come back to this after chatting with colleagues and Palantir Engineers.
Palantir has two types of “Datasets”: A tabular file or a folder of generic files. AOC puzzle inputs are text files. I opted to use a folder of files where the raw text files are uploaded. The files need a naming convention to make life easier for everyone involved (wonder if you could add checks for the type of data added to the folder such as they must have a .txt suffix and it must match a naming convention).
For AOC, each day there is a sample input for you to work on and try to get the expected answer. These files will be named “sample_day_DD_part_N.txt”. User files should be named “firstname_lastname_day_DD_input.txt”. These files are then dragged into the UI.
In theory there is no need to process these files as you can work with the text files in any language. However, to easily access these files across all Palantir tools we are going to store then as a table with each row being a VARCHAR of the text file.
I don’t need full-blown spark for this so i’m just using polars. The code to save the text file as a dataframe is:
with open("file.txt", "rb") as f:
lines = f.readlines()
df = pl.DataFrame({"input": [line.strip().decode() for line in lines]})
We want to apply this to all files in the folder so we’ll use a transform. We also need to use a transform generator to run these in parallel:
def transform_generator(file_names: list):
transforms = []
for file_name in file_names:
@lightweight
@transform(my_input=Input("ri.foundry.main.dataset.XXX"), my_output=Output(f"/FOLDER/{file_name}"))
def text_file_to_table(my_input, my_output, file_name):
try:
_file = next(my_input.filesystem().ls(glob=f"{file_name}.txt"), None)
with my_input.filesystem().open(_file.path, "rb") as f:
lines = f.readlines()
my_output.write_table(pl.DataFrame({"input": [line.strip().decode() for line in lines]}))
except Exception as e:
print(f"Error processing {file_name}: {str(e)}")
transforms.append(text_file_to_table)
return transforms
file_names = ["...", "..."]
TRANSFORMS = transform_generator(file_names)
(it would be nice to have this build trigger when a file is uploaded)
Next we will access these files in a Jupyter notebook. Palantir keeps things as separate as possible by default for security and audibility. To access Datasets in a Jupyter notebook you have to click an import button. When you do this this create a file called .foundry/aliases.yml with the RID’s (resource Identifier). Ain’t nobody got time to click import 25 times so there is a way to automate this (it would be nice if there was a “import all datasets in a project button”.
First you need to create a dictionary which contains the datasets and there RID’s. I stumbled a bit here:
- I was getting permission errors using the Foundry Platform SDK.
- Jupyter notebooks have some environment variables such as TOKEN and HOSTNAME but I couldn’t use these to authenticate. I’m not sure if I need to setup an egress policy for this.
I worked out a solution using foundry-dev-tools:
from foundry_dev_tools import FoundryContext
ctx = FoundryContext()
children = list(ctx.compass.get_child_objects_of_folder(folder_ri))
data = {f["name"]: {"rid": f["rid"]} for f in children}
(would be nice to automate this on start up in a jupyter notebook).
This contains all the processed files and therefore needs to be filtered to just the user, the sample data and the raw dataset.
user = os.environ["GIT_AUTHOR_NAME"].lower().replace(" ", "_")
filter_terms = [user, "sample", "raw_puzzle_input_text_files"]
filtered_data = {k: v for k, v in data.items() if any(term in k.lower() for term in filter_terms)}
Now the files can to saved to the expected yaml file
with open('.foundry/aliases.yml', 'w') as f:
yaml.dump(filtered_data, f, default_flow_style=False)
This logic is saved in a python file (it would be nice to run this on start up for any new jupyter workspace in a project).
A user can open a notebook and run
%run /home/user/repo/import_datasets.py
They can then work with the data as:
from foundry.transforms import Dataset
# Read the processed data in a format of your choice
file = "sample_day_01_part_1"
arrow_table = Dataset.get(file).read_table(format="arrow")
pandas_df = Dataset.get(file).read_table(format="pandas")
try:
polars_df = Dataset.get(file).read_table(format="polars")
except ModuleNotFoundError:
polars_df = None
# Create your own logic to parse the text files
local_file = (
Dataset("raw_puzzle_input_text_files")
.files()
.filter(lambda f: f.path == f"{file}.txt")
.download()
)
with open(local_file[f"{file}.txt"], "r") as f:
lines = f.readlines()