Handling retries in transform filesystem calls

pwoody · January 22, 2025, 4:46pm

Hey all,

I have a fairly uninspired python transform that runs as a lightweight transform and all it does is extract a tgz file that is ~8GB and write the files one-by-one to a Foundry dataset via output.filesystem().open(). The overall uncompressed size is ~180GB and just based on logging and upload rate it seemed like it was going to take 6hr to complete, but after a while I’ve gotten the following error on both attempts about 3hr in:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='localhost', port=8188): Max retries exceeded with url: /foundry-data-sidecar/api/datasets/output_dataset/files/<redacted>.csv/content (Caused by ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))
transforms.api._lightweight - ERROR - Your @lightweight transform (extract_0) has failed with return code 1.

Does anyone have recommendations around how to guard against this or increase the retry number? I’m not sure behind the scenes what the retry strategy is or even if this is recoverable.

I ended up just getting around this problem by spinning up an ec2 instance to toss un-tar and toss the data into s3 in ~20 min, so fully aware that there are better tools for this job but I was hoping to quickly set and forget the transform.

Thanks!
Pat

nicornk · January 22, 2025, 7:32pm

I would use a spark / classic transforms which will have no sidecar container and your writes to the filesystem would directly hit the underlying foundry filesystem.

system · March 23, 2025, 7:33pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.