Concurrency for API Calls

lucywu12 · December 24, 2024, 10:13am

I (finally) got my code to work - now with the modification of making API calls with pagination to retrieve data from my dataset!

However, it still takes a pretty decent amount of time to process and retrieve rows which may/may not be due to the string/text cleaning that I needed to incorporate in order to get cleaned JSON output. Ideally, I’d like to get closer to retrieving hundreds of thousands of rows as opposed to just thousands.

I was wondering if there is any documentation or examples on using concurrency in PySpark, especially since we are limited to 30 concurrent API calls on free tier. Or, if there are any other suggestions for doing this more efficiently since the build process also takes a few minutes. (AIPAssist has been really helpful in generating this code, but would be nice to better understand how it actually works!)

Let me know if more details would be helpful, but thought I would ask in general.

george · December 24, 2024, 1:23pm

Hey, I’d be curious to know what you mean by “making API calls with pagination to retrieve data from my dataset!” With Pipeline Builder the parallelization of processing a dataset is done for you.

Same thing for reading a dataset via the native python/java transforms libraries in code repositories

lucywu12 · December 24, 2024, 11:29pm

Let me clarify! I am quite literally making a single giant GET request to retrieve my data (since I don’t think there’s a way to download it all and import it as a dataset unfortunately - not sure if this is the best way to do this, but it’s what made sense in the moment). So this step isn’t even processing/transforming data, just retrieving it in JSON format and adding it in like this: https://learn.palantir.com/speedrun-data-connection/1864458

Right now I’m using a code repository to do this, but wondering if there’s a way I can have it make multiple requests at once to speed things up (it takes around 20 min for 100k rows which seems quite long).

sandpiper · December 25, 2024, 5:08am

Is there any chance that you could share your code (with sensitive details redacted as necessary)? There might be multiple non-standard aspects of your approach that make it difficult to give appropriate advice without seeing the whole picture.

nicornk · December 25, 2024, 9:10am

Is your data already in Foundry? It sounds like you are making requests to an external system to bring it in?

ShreyashRanjan · December 25, 2024, 11:16am

I’m not sure if you tried this or not but maybe to speed up data retrieval in PySpark when making API calls, you can leverage concurrency by using Python’s asyncio to handle multiple API requests in parallel, respecting the 30-call limit.

Also in PySpark, consider using pyspark.sql.functions.udf with async methods to distribute requests across partitions, enabling parallel execution.

And maybe also, batching requests and implementing retry logic can optimize performance while avoiding rate limits, hope this helps!

george · December 27, 2024, 1:26pm

Depending on the structure of the API you’re hitting, you could have an input dataset where each row represents the parameters of one query. For example, if it does offset pagination, you could have a dataset of
url | offset
xyz.com | 1
xyz.com | 2

And then the pipeline you can use UDFs as Shreyash notes to have executors handle the api calls rather than drivers. See an example of this here https://www.palantir.com/docs/foundry/data-integration/external-transforms#add-foundry-dataset-inputs

When the executors are handling the API calls, they will (in most cases) by parallelized based on # of executors.

system · February 25, 2025, 1:26pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.