Using GraphFrames for Python Transforms in Code Repos

GraphFrames is a pyspark wrapper around GraphX, which is a scala based distributed graph processing library that lets you optimize and scale graph-based algorithms.

Setup

The following has been working as last tested on November 12, 2025.

1. Install GraphFrames into your Code Repository

Your meta.yaml should have -graphframes in its requirements / run section.

requirements:
  # Tools required to build the package. These packages are run on the build system and include
  # things such as revision control systems (Git, SVN) make tools (GNU make, Autotool, CMake) and
  # compilers (real cross, pseudo-cross, or native when not cross-compiling), and any source pre-processors.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#build
  build:
    - python
    - setuptools
  # Packages required to run the package. These are the dependencies that are installed automatically
  # whenever the package is installed.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#run
  run:
    - python
    - transforms {{ PYTHON_TRANSFORMS_VERSION }}
    - transforms-expectations
    - transforms-verbs
    - graphframes

2. Add the required dependency

When is a door not a door? When it’s a[conda]jar.

Add the following jar dependency in your ~/transforms-python/build.gradle file as a new line:
dependencies { condaJars 'graphframes:graphframes:0.8.0-spark3.0-s_2.12' }

Note that it would be worth looking at graphframe releases and matching the jars to the releases accordingly. The above works for v0.8.0.

buildscript {
    repositories {
        maven {
            credentials {
                username ''
                password transformsBearerToken
            }
            authentication {
                basic(BasicAuthentication)
            }
            url project.transformsMavenProxyRepoUri
        }
    }

    dependencies {
        classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
    }
}

apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies { condaJars 'graphframes:graphframes:0.8.0-spark3.0-s_2.12' }

3. Rebuild your workspace

This is required to update the env.

1 Like

Here’s how you would use it

1. Construct a graph

from graphframes import GraphFrame
graph = GraphFrame(vertices, edges)

Here, vertices and edges are PySpark dataframes.
→ vertices should contain an id column
edges should (at least) contain a src and a dst column

2. And this is important → Add a checkpoints directory

This will be used by the library to store intermediate results, and it should be accessible by the node calls it in the cluster when the build runs.

import tempfile
...
def transform(ctx, ...):
    ...
    checkpoint_dir = tempfile.gettempdir()
        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)
        ctx.spark_session.sparkContext.setCheckpointDir(checkpoint_dir)
    ...

3. Use the graphframes library!

For example, if you need connected components, you can do something like:

result = graph.connectedComponents(algorithm="graphx")
component_reps = result.groupBy("component").agg(F.min(F.col("id")).alias("component_reporesentative"))

The algorithm="graphx" was crucial in getting this to work - it defines what framework to use when computing.

1 Like