We are working on geospatial data pipelines and trying to get Apache Sedona set up within code repositories. We’re having trouble actually being able to use Sedona in the transforms, though. I’ve checked out this doc which introduces the @geospatial decorator (https://www.palantir.com/docs/foundry/geospatial/vector_data_in_transforms) but it looks like this is being deprecated. Additionally, I tried to locate the geospatial-tools library within the Foundry code repository, but it’s not showing up when I search (so I’m actually unable to use the @geospatial decorator at all).
Would appreciate any advice here on methods/best practices for using Sedona within transforms. I was able to use the “GEOSPARK” spark profile, but it seems that this doesn’t allow us to actually use Sedona (it looks like it configures the Spark cluster but doesn’t allow us to actually use Sedona)
It seems another user had this issue a few years ago: https://stackoverflow.com/questions/73254417/can-not-add-geospatial-tools-dependency-to-palantir-foundry-code-repository
Ideally, I’d like to be able to interact directly with Sedona at the code level rather than in pipeline builder. We have some Sedona logic that we’ve verified can run on a local spark cluster, but moving this into Foundry has been proven to be quite the challenge.
Sorry to hear that you’ve been having trouble getting this set up! Here’s a brief guide that I can confirm works now on a freshly bootstrapped transforms-python repository.
Add the following section at the bottom of the transforms-python/build.gradle file. Note that the exact versions of these libraries available in your environment may differ (you can search in the Libraries panel of a transforms-java repository to see what is available). Also note that you may need to manually add external-jar to your repository’s backing artifact repositories (in my case, it was added automatically after an initial check run failed).
Verify that things work with the following sample code. It seems that SedonaRegistrator doesn’t work properly in Code Repositories preview, but it worked fine for me in VSCode Preview and during an actual build.
from pyspark.sql import functions as F
from sedona.register import SedonaRegistrator
from sedona.sql import st_constructors as stc
from transforms.api import Output, transform_df
@transform_df(
Output("..."),
)
def compute(ctx):
SedonaRegistrator.registerAll(ctx.spark_session)
df = ctx.spark_session.sql("SELECT array(0.0, 1.0, 2.0) AS values")
min_value = F.array_min("values")
max_value = F.array_max("values")
return df.select(stc.ST_Point(min_value, max_value).alias("point"))
I used a slightly different approach of creating my own decorator in utils.py. Having good results with it, and works in code repo previews.
If you use, SedonaContext.create() it automatically handles All function registration, serializer configuration, data source registration and needed imports
from typing import Any, Callable
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from sedona.spark import SedonaContext
from transforms.api import Transform
def geospatial(spark_session: SparkSession = None):
"""
Transform decorator that registers Sedona functions and data sources for this spark session
:param spark_session: (Spark Session) [optional] required for use in Code Workbooks
:return: either a valid Transform object (Authoring) or the transform function (Code Workbooks)
"""
def _geospatial(transform: Any):
if isinstance(transform, Transform):
# Authoring - spark session can be derived from ctx
def register_compute(compute: Callable):
def compute_wrapper(ctx, *args, **kwargs):
# SedonaContext.create() enhances the existing session in-place
# It doesn't return a new session, it modifies the existing one
SedonaContext.create(ctx.spark_session)
# The original spark session now has Sedona capabilities
return compute(ctx, *args, **kwargs)
return compute_wrapper
transform.compute = register_compute(transform.compute)
return transform
return _geospatial
You still need to specify the external-jar in build.gradle:
And add the apache-sedona dependency as mentioned: apache-sedona
Then you can use the decorator like this:
from sedona.sql import st_constructors as stc
from sedona.sql import st_functions as stf
from sedona.sql import st_aggregates as sta
from sedona.sql import st_predicates as stp
from sedona.stats.clustering.dbscan import dbscan
from transforms.api import Input, Output, transform
from myproject.datasets.utils import geospatial
@geospatial()
@transform(
out=Output(""),
source=Input(""),
)