How to effectively preview PB transformations on large datasets?

nsns · November 1, 2024, 4:41pm

I have 6 relatively large datasets (5 smaller, 100-400GB, and 1 bigger, 2.6TB). In total they have ~50B rows. I do several steps of cleaning and filtering, by which the dataset is ~20B rows, before I do aggregations. I’d like to dive deeper in my aggregations and use preview. However, even when I sample the dataset my preview breaks or it takes unreasonably long. I tired both capping to 500 rows per dataset and sampling 0.02% of each of them, in both cases I have the same issue.

Is there an alternative to me just creating a separate pipeline to sample my datasets, test the pipeline and swap out the inputs later when I want to run the full build?

helenq · November 1, 2024, 5:53pm

Hey @nsns when you say you tried capping to 500 rows, do you actually filter out the dataset to only 500 rows (like as a first transform) and then work from there? Or you’re just using the preview sampling strategies

nsns · November 1, 2024, 6:05pm

I limited in the preview 500 rows.

helenq · November 1, 2024, 6:11pm

How large are your rows? Are there any string values that are super long or do you have 100s of columns?

You could try filtering at the beginning of your pipeline to take the row counts down or try an even smaller sampling strategy as a start

nsns · November 1, 2024, 6:14pm

Rearranging some filters and adding another null-check/filter did help. And now the preview computes in 30-60 seconds. However, if I’d to assume that the input was only 500 rows from each of the datasets, these additional filters shouldn’t in essence make a difference given how small the data should be expected at input.