Using Apache Sedona in Code Repositories

We are working on geospatial data pipelines and trying to get Apache Sedona set up within code repositories. We’re having trouble actually being able to use Sedona in the transforms, though. I’ve checked out this doc which introduces the @geospatial decorator (https://www.palantir.com/docs/foundry/geospatial/vector_data_in_transforms) but it looks like this is being deprecated. Additionally, I tried to locate the geospatial-tools library within the Foundry code repository, but it’s not showing up when I search (so I’m actually unable to use the @geospatial decorator at all).

Would appreciate any advice here on methods/best practices for using Sedona within transforms. I was able to use the “GEOSPARK” spark profile, but it seems that this doesn’t allow us to actually use Sedona (it looks like it configures the Spark cluster but doesn’t allow us to actually use Sedona)

It seems another user had this issue a few years ago: https://stackoverflow.com/questions/73254417/can-not-add-geospatial-tools-dependency-to-palantir-foundry-code-repository

Ideally, I’d like to be able to interact directly with Sedona at the code level rather than in pipeline builder. We have some Sedona logic that we’ve verified can run on a local spark cluster, but moving this into Foundry has been proven to be quite the challenge.

Thanks so much in advance for the help.

Sorry to hear that you’ve been having trouble getting this set up! Here’s a brief guide that I can confirm works now on a freshly bootstrapped transforms-python repository.

  1. Add the following section at the bottom of the transforms-python/build.gradle file. Note that the exact versions of these libraries available in your environment may differ (you can search in the Libraries panel of a transforms-java repository to see what is available). Also note that you may need to manually add external-jar to your repository’s backing artifact repositories (in my case, it was added automatically after an initial check run failed).
dependencies {
    condaJars "org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.6.0"
    condaJars "org.datasyslab:geotools-wrapper:1.6.0-28.2"
}
  1. Add the apache-sedona conda dependency.

  2. Verify that things work with the following sample code. It seems that SedonaRegistrator doesn’t work properly in Code Repositories preview, but it worked fine for me in VSCode Preview and during an actual build.

from pyspark.sql import functions as F
from sedona.register import SedonaRegistrator
from sedona.sql import st_constructors as stc
from transforms.api import Output, transform_df


@transform_df(
    Output("..."),
)
def compute(ctx):
    SedonaRegistrator.registerAll(ctx.spark_session)
    df = ctx.spark_session.sql("SELECT array(0.0, 1.0, 2.0) AS values")

    min_value = F.array_min("values")
    max_value = F.array_max("values")

    return df.select(stc.ST_Point(min_value, max_value).alias("point"))
1 Like

I used a slightly different approach of creating my own decorator in utils.py. Having good results with it, and works in code repo previews.

If you use, SedonaContext.create() it automatically handles All function registration, serializer configuration, data source registration and needed imports

from typing import Any, Callable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from sedona.spark import SedonaContext
from transforms.api import Transform


def geospatial(spark_session: SparkSession = None):
    """
    Transform decorator that registers Sedona functions and data sources for this spark session
    :param spark_session: (Spark Session) [optional] required for use in Code Workbooks
    :return: either a valid Transform object (Authoring) or the transform function (Code Workbooks)
    """

    def _geospatial(transform: Any):
        if isinstance(transform, Transform):
            # Authoring - spark session can be derived from ctx
            def register_compute(compute: Callable):
                def compute_wrapper(ctx, *args, **kwargs):
                    # SedonaContext.create() enhances the existing session in-place
                    # It doesn't return a new session, it modifies the existing one
                    SedonaContext.create(ctx.spark_session)

                    # The original spark session now has Sedona capabilities
                    return compute(ctx, *args, **kwargs)

                return compute_wrapper

            transform.compute = register_compute(transform.compute)
        return transform

    return _geospatial

You still need to specify the external-jar in build.gradle:

dependencies {
    condaJars "org.apache.sedona:sedona-spark-shaded-3.5_2.12:1.7.2"
    condaJars "org.datasyslab:geotools-wrapper:1.7.2-28.5"
}

And add the apache-sedona dependency as mentioned: apache-sedona

Then you can use the decorator like this:

from sedona.sql import st_constructors as stc
from sedona.sql import st_functions as stf
from sedona.sql import st_aggregates as sta
from sedona.sql import st_predicates as stp
from sedona.stats.clustering.dbscan import dbscan
from transforms.api import Input, Output, transform

from myproject.datasets.utils import geospatial


@geospatial()
@transform(
    out=Output(""),
    source=Input(""),
)
1 Like