Using python's requests library to download files from an external website?

Currently, I’m using

url = ‘http://export.arxiv.org/api/querysearch_query=all:’+keywordsToString+‘&start=0&max_results=2’
data = urllib.request.urlopen(url)

to download PDFs from ArXiv, which I intend to save to a Media Set. I’ve got this working on my local computer, but it seems to timeout on Foundry. Is there a different way I should be doing this

Are you currently using external transforms in a python transform Code Repository?

Some docs here for external transforms alongside code examples:
External transforms

I was kind of getting the same issue, but with another media set download. I think it is due to Foundry’s network restrictions on the way external APIs handle long-running requests. Here are some things I tried:

1/ Handle Large Files Asynchronously

If the request involves downloading large files (e.g., PDFs), Foundry’s pipelines or workflows can handle this more efficiently by just making these downloads batch processes.

You could extract metadata from the ArXiv API first and store the URLs in a dataset, then use a separate process to download PDFs if necessary.

2/ Improving Timeout Settings

If you’re tied to using Python scripts within Foundry, libraries like requests are more robust than urllib for handling timeouts and retries.

Example:

import requests

url = 'http://export.arxiv.org/api/query'
params = {'search_query': f'all:{keywords}', 'start': 0, 'max_results': 2}
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()

So this way the timeout window is longer.

Sorry I can’t be of more help, this is as far as my knowledge reaches in this topic. Hope this does provide a clearer picture. Let me know if this helps or if you solved it - would love to know what the solution ended up being.

1 Like

Thank you, a combination of this and the documentation has got it working!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.