My preferred method would be to use spark here. Does the API use pagination? Do you have a method to determine the total number of pages or rows?
If so you could do something like:
- Create a spark DF with one row per item and a page number (use
F.sequence
and explode, don’t generate it on the driver) df.groupBy(['page']).apply(my_pandas_udf)
- Pandas udf calls the api to get the rows for that page and return
You can do this with a standard (non-pandas) udf, but this won’t allow you to easily do batching (you may find out that you can do more than one page per executor).
If you want to use your current methodology, just build DFs instead of writing to CSVs and then use union_many
from dataframes verbs to union them into a tabular output. Creating the DF will allow spark to persist it to disk so should effectively do the same as your current method of writing CSVs.
NB that your idea of using DRIVER_MEMORY_EXTRA_LARGE
will not be efficient. You’re proposing to do everything single-node so the additional memory won’t really be used. You want to use the DRIVER_MEMORY_OVERHEAD...
to give memory to the python process, not the spark driver process.