I have 6 relatively large datasets (5 smaller, 100-400GB, and 1 bigger, 2.6TB). In total they have ~50B rows. I do several steps of cleaning and filtering, by which the dataset is ~20B rows, before I do aggregations. I’d like to dive deeper in my aggregations and use preview. However, even when I sample the dataset my preview breaks or it takes unreasonably long. I tired both capping to 500 rows per dataset and sampling 0.02% of each of them, in both cases I have the same issue.
Is there an alternative to me just creating a separate pipeline to sample my datasets, test the pipeline and swap out the inputs later when I want to run the full build?
Hey @nsns when you say you tried capping to 500 rows, do you actually filter out the dataset to only 500 rows (like as a first transform) and then work from there? Or you’re just using the preview sampling strategies
Rearranging some filters and adding another null-check/filter did help. And now the preview computes in 30-60 seconds. However, if I’d to assume that the input was only 500 rows from each of the datasets, these additional filters shouldn’t in essence make a difference given how small the data should be expected at input.