I’m trying to understand the memory model of leightweight transforms.
As the technology I use the Polars transforms. (All lazy)
I did some test with varying sizes of input-data (I use the size of of dataset as indicated in the preview or DataLineage).
To process data of 1GB of input data (Output = 358MB) I need a engine of 32GB @lightweight(cpu_cores=4, memory_gb=32)
To process 7GB of input (Output = 358MB) I need a engine of 64GB @lightweight(cpu_cores=4, memory_gb=64)
To process 7GB of input (Output = 3,8GB) I need a engine of 96GB @lightweight(cpu_cores=8, memory_gb=96)
Having less memory produces an outofmemory error.
How does this work? And how can I estimate whether the I’m at the limit of the requested memory or somewhere in a safe space.
The order of magnitude of Input-size to needed size is so strange that this introduces a sort of uncertainty…
It does seem odd to be needing this sort of resources, especially with lazy polars. I would be curious to know how it compares if you try to run such workflows outside of foundry? Lightweight transforms should not add much overhead on top of regular polars.
Either way, unfortunately we don’t have good visibility into resource utilization for lightweight transforms right now, so your best bet is finding the sweetspot with some tests. This is something we will be working on this year to provide users with more observability and tools similar to what we have for spark.
I suspect that part of the issue that makes the estimation hard is that the datasets are stored as compressed Parquet (https://parquet.apache.org/docs/file-format/data-pages/compression/) files, zipped with Snappy (https://github.com/google/snappy/blob/main/format_description.txt).
So depending on the datasets content, they will expand by significantly different amounts when loaded into memory, @lmartini do you know if there’s a way of perhaps seeing a datasets uncompressed size somehow please? either some dataset properties somewhere, or via some transform function? That could help in determining more easily which transforms are suitable for lightweight usage.
Thanks.
P.S. Apparently I can’t include links in my post, so if you want to follow them you’ll have to copy and paste them.
from foundry.transforms import Dataset
import polars as pl
the_df = Dataset.get("the_datasets_of_4dot5GB_parquet").read_table(format="polars")
And lucky us. the polars have function: polars.DataFrame.estimated_size()
https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.estimated_size.html
I get this:
So my 4.5GB Parquet inflate to 14GB in Polars. I will now check my inputs to validate the size.
I suspect that due to the fact that polars are an external library (written in Rust) i will have with each join a byValue semantics that blows up the memory.