How to use Virtual Tables for a Very Large Dataset

zoe · April 8, 2025, 3:20pm

I was wondering whether you can load a filtered view for a virtual table or not pull in the full table in transformations? We have a very very large table that breaks preview when trying to load - so I haven’t been able to confirm whether the connection is set up correctly either. I’m trying to use the Split feature in pipeline builder but this also relies on preview working so it isn’t loading either.

In general, it would be great if we can only load a filtered view (similar to incremental in data connection) but use virtual tables because the data changes often including historic data, we only need the last 3 months though.

hugo · April 8, 2025, 3:26pm

Hey Zoe!

This largely depends on what upstream/source system you are using the virtual table on.
If the connection is JDBC based(such as snowflake), a lot of the query can be done in the upstream source system allowing a lot of the data to be filtered away. (Assuming your query contains a filter).
For raw parquet files however, depending on the partitioning, filtering can/can-not be possible, which can cause preview to fail on large datasets.

You could also get a better idea of what’s going on if you could get the spark query plan by running it as an actual build.

If you could share the upstream type I could give more specifics about how we handle that case.

system · June 7, 2025, 3:27pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.