Fetching and writing large amounts of rows efficiently

Ben · January 14, 2025, 12:28pm

My preferred method would be to use spark here. Does the API use pagination? Do you have a method to determine the total number of pages or rows?

If so you could do something like:

Create a spark DF with one row per item and a page number (use F.sequence and explode, don’t generate it on the driver)
df.groupBy(['page']).apply(my_pandas_udf)
Pandas udf calls the api to get the rows for that page and return

You can do this with a standard (non-pandas) udf, but this won’t allow you to easily do batching (you may find out that you can do more than one page per executor).

If you want to use your current methodology, just build DFs instead of writing to CSVs and then use union_many from dataframes verbs to union them into a tabular output. Creating the DF will allow spark to persist it to disk so should effectively do the same as your current method of writing CSVs.

NB that your idea of using DRIVER_MEMORY_EXTRA_LARGE will not be efficient. You’re proposing to do everything single-node so the additional memory won’t really be used. You want to use the DRIVER_MEMORY_OVERHEAD... to give memory to the python process, not the spark driver process.