What's the best way to do geocoding in Foundry?

4b74c51a335df2a25c75 · July 10, 2024, 5:24pm

Hey,
I’d like to convert text addresses into coordinates from within the pipeline, either via transforms or pipeline builder.

I’m thinking of treating it as an External Transform in Transforms or via an UDF in Pipeline Builder, that would make the external call.

I was wondering if there’s a smarter way to do it, given there are mentions of Mapbox integrations for geospatial workflows in the doc, but I couldn’t find anything for geocoding.

Thanks !

Julien

nicornk · July 11, 2024, 11:06am

We would also love to see this as a first-class platform feature. Very common workflow.

ecitaku · July 30, 2024, 10:31am

AFAIK, there is currently no way of doing this in a first class way!

We did it a few times by now and here are some recommendations I have from my experience doing this with regular Transforms:

Use an incremental pipeline that caches geocoding results: In an incremental pipeline you can read the previous output and add new geocoding results to the output.
This way you won’t need to constantly geocode the same adresses multiple times if they reappear.

Don’t use UDF’s: From my experience UDF’s create a ton of overhead and take up multiple workers. It’s much easer to just take a limited amount of Rows from dataframes and iterate through them with regular python code.
If you need parallelisation you can use Threads instead.
If you need to geocode more rows at once you can just run the pipeline more frequently and it will fill up you cache over time.

If anyone is interested I can search through some previous code and share some snippets!

christian-bader · November 24, 2024, 3:04pm

Very interested in seeing some past examples! I am working through this problem now. Echo the sentiment that this would be a first-class platform feature

david · November 26, 2024, 2:39pm

I know this is not immediately helpful, and we (pipeline builder) have heard this is an important workflow. We are building out a first class solution coming soon (~months)!

christian-bader · December 2, 2024, 7:52pm

Awesome, thank you for the feedback!

Joel · April 14, 2025, 4:04pm

Hi @david, may I please request a status update on this feature? I am looking for address verification solutions, preferably in Pipeline Builder.

arukavina · April 14, 2025, 4:55pm

Hi all — thanks for the great discussion here!

I wanted to share a small snippet that might help others get started with geocoding in Foundry, especially if you’re looking to first test things locally before turning them into a pipeline or multi-threaded transformation. This approach uses the Nominatim API (OpenStreetMap), and can be a handy base to build on.

Here’s a Python example I’ve used to geocode U.S.-based addresses from a CSV file:

import pandas as pd
import requests
import time
import os
from tqdm import tqdm

tqdm.pandas()

def geocode(address, city, state):
    search_address = f"{address}, {city}, {state}, USA".strip().strip(',')

    if not search_address or search_address == "USA":
        return [None, None]

    base_url = "https://nominatim.openstreetmap.org/search"
    params = {
        "q": search_address,
        "format": "json",
        "limit": 1,
        "addressdetails": 1,
        "countrycodes": "us"
    }

    headers = {
        "User-Agent": "OSMGeoCode",
        "email" : "your_email"  # Replace with your contact email
    }

    try:
        response = requests.get(base_url, params=params, headers=headers)
        if response.status_code == 200:
            data = response.json()
            if data:
                lat = float(data[0]["lat"])
                lon = float(data[0]["lon"])
                time.sleep(1)  # Respect Nominatim rate limit
                return [lat, lon]

        time.sleep(1)
        return [None, None]

    except Exception:
        time.sleep(1)
        return [None, None]

How to use:

def geo_code(address, city, state):
    return geocode(address, city, state)

INPUT_FOLDER = 'input'
OUTPUT_FOLDER = 'output'
df = pd.read_csv(os.path.join(INPUT_FOLDER, 'addresses.csv'))

features_df = df.progress_apply(lambda row: geo_code(row.iloc[0], row.iloc[1], row.iloc[2]), axis=1, result_type='expand')
features_df.columns = ['lat', 'long']

result_df = pd.concat([df, features_df], axis=1)
result_df.to_csv(os.path.join(OUTPUT_FOLDER, 'geocoded_addresses.csv'), index=False)

A few notes:

This is designed for simple use cases — and mostly for small/medium datasets.
It respects Nominatim’s usage policy, including rate limits (1 request/sec).
You can easily imagine adapting this into a Foundry Code Workbook or converting it into a multi-threaded Spark transformation depending on scale.

Next steps:

Porting this into a Foundry transformation (happy to collaborate if others are on a similar path).
Adding multi-threaded capability for use with larger datasets (while still respecting API limits).

Hope this helps as a starting point — would love to hear how others are solving this too!