Hi all
— thanks for the great discussion here!
I wanted to share a small snippet that might help others get started with geocoding in Foundry, especially if you’re looking to first test things locally before turning them into a pipeline or multi-threaded transformation. This approach uses the Nominatim API (OpenStreetMap), and can be a handy base to build on.
Here’s a Python example I’ve used to geocode U.S.-based addresses from a CSV file:
import pandas as pd
import requests
import time
import os
from tqdm import tqdm
tqdm.pandas()
def geocode(address, city, state):
search_address = f"{address}, {city}, {state}, USA".strip().strip(',')
if not search_address or search_address == "USA":
return [None, None]
base_url = "https://nominatim.openstreetmap.org/search"
params = {
"q": search_address,
"format": "json",
"limit": 1,
"addressdetails": 1,
"countrycodes": "us"
}
headers = {
"User-Agent": "OSMGeoCode",
"email" : "your_email" # Replace with your contact email
}
try:
response = requests.get(base_url, params=params, headers=headers)
if response.status_code == 200:
data = response.json()
if data:
lat = float(data[0]["lat"])
lon = float(data[0]["lon"])
time.sleep(1) # Respect Nominatim rate limit
return [lat, lon]
time.sleep(1)
return [None, None]
except Exception:
time.sleep(1)
return [None, None]
How to use:
def geo_code(address, city, state):
return geocode(address, city, state)
INPUT_FOLDER = 'input'
OUTPUT_FOLDER = 'output'
df = pd.read_csv(os.path.join(INPUT_FOLDER, 'addresses.csv'))
features_df = df.progress_apply(lambda row: geo_code(row.iloc[0], row.iloc[1], row.iloc[2]), axis=1, result_type='expand')
features_df.columns = ['lat', 'long']
result_df = pd.concat([df, features_df], axis=1)
result_df.to_csv(os.path.join(OUTPUT_FOLDER, 'geocoded_addresses.csv'), index=False)
A few notes:
- This is designed for simple use cases — and mostly for small/medium datasets.
- It respects Nominatim’s usage policy, including rate limits (1 request/sec).
- You can easily imagine adapting this into a Foundry Code Workbook or converting it into a multi-threaded Spark transformation depending on scale.
Next steps:
- Porting this into a Foundry transformation (happy to collaborate if others are on a similar path).
- Adding multi-threaded capability for use with larger datasets (while still respecting API limits).
Hope this helps as a starting point — would love to hear how others are solving this too!