I (finally) got my code to work - now with the modification of making API calls with pagination to retrieve data from my dataset!
However, it still takes a pretty decent amount of time to process and retrieve rows which may/may not be due to the string/text cleaning that I needed to incorporate in order to get cleaned JSON output. Ideally, I’d like to get closer to retrieving hundreds of thousands of rows as opposed to just thousands.
I was wondering if there is any documentation or examples on using concurrency in PySpark, especially since we are limited to 30 concurrent API calls on free tier. Or, if there are any other suggestions for doing this more efficiently since the build process also takes a few minutes. (AIPAssist has been really helpful in generating this code, but would be nice to better understand how it actually works!)
Let me know if more details would be helpful, but thought I would ask in general.
Hey, I’d be curious to know what you mean by “making API calls with pagination to retrieve data from my dataset!” With Pipeline Builder the parallelization of processing a dataset is done for you.
Same thing for reading a dataset via the native python/java transforms libraries in code repositories
Let me clarify! I am quite literally making a single giant GET request to retrieve my data (since I don’t think there’s a way to download it all and import it as a dataset unfortunately - not sure if this is the best way to do this, but it’s what made sense in the moment). So this step isn’t even processing/transforming data, just retrieving it in JSON format and adding it in like this: https://learn.palantir.com/speedrun-data-connection/1864458
Right now I’m using a code repository to do this, but wondering if there’s a way I can have it make multiple requests at once to speed things up (it takes around 20 min for 100k rows which seems quite long).
Is there any chance that you could share your code (with sensitive details redacted as necessary)? There might be multiple non-standard aspects of your approach that make it difficult to give appropriate advice without seeing the whole picture.
I’m not sure if you tried this or not but maybe to speed up data retrieval in PySpark when making API calls, you can leverage concurrency by using Python’s asyncio to handle multiple API requests in parallel, respecting the 30-call limit.
Also in PySpark, consider using pyspark.sql.functions.udf with async methods to distribute requests across partitions, enabling parallel execution.
And maybe also, batching requests and implementing retry logic can optimize performance while avoiding rate limits, hope this helps!