I have a dataset that is a response from an api call with all the data dropped into 1 cell in a Foundry dataset. I see the frontend will truncate the data, but it is still there. I’m wondering is data ever just dropped when the scale gets too big? And if so, what is the limit where it’ll start dropping data in a cell?
I’m not sure there is a limit. However, if you’re going to process the results with spark, you might see poor performance or OOM an executor.
I doubt you’ll be joining the raw data to anything, but if you did, shuffling such large rows across executors will take longer than expected. If the rows are large enough, even reading the dataset as a spark dataframe could cause your build to fail with an OOM’d executor.
A better way to store the raw JSON responses might be to write them to unstructured files instead of creating and saving a dataframe, which will generate tabular parquet files.
Thanks. So in theory, would the limit be some Spark limit then? Like a limit on the size an executor could be?
Effectively, yes. But that limit would be up there with the limit on the total size of the dataset anyway. And you could in that case get around the limit by adding executor memory.
But in most situations if your response is very large it’s going to be easier to save it to a regular file instead of through a spark dataframe.
Got it. Thanks. How would we know if/when we’re reaching the limit? Would our executors start to fail? Like what would be the first sign we need to change things (i.e. flip to saving things as a regular file)?
Should see your builds start to fail with executor not reachable due to OOMs once you hit that limit. Before that you’ll likely see performance degradation, especially in joins since network shuffles will be expensive.
Ok thanks. That all makes sense.