Inner join a small dataset and a massive timeseries dataset in Foundry efficiently

lorenzojb · May 12, 2025, 1:51pm

Hey,

I have 2 datasets. One is a small number of timeseries ids (1k - 15k) and the other is a 8.3 trillion row timeseries dataset of size 50+ TB. The timeseries has a soho projection associated to it since it is indexed as a timeseries. My goal is to perform an inner join of the two to effectively just extract the timeseries with the timeseries ids stored in the smaller dataset.

When performing an inner join between these datasets, spark seems to try to batchscan the entire 50 TB dataset which takes an extremely long time and is very costly.

Whats the most efficient way to perform this operation?

green · May 12, 2025, 3:18pm

Have you tried adding the ‘broadcast’ hint to the join query for the smaller dataset? Hopefully Spark is already doing that automatically, but it doesn’t always, and that can help if the smaller dataset is so small it can be sent to every node.

More complex options exist as well: https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints-for-sql-queries

lorenzojb · May 12, 2025, 4:04pm

I see this in my physical plan so I believe the join is already being broadcast but its still extremely slow

            +- BroadcastHashJoin [<redacted>#83], [<redacted>#86], Inner, BuildLeft, false
               :- BroadcastQueryStage 3
               :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [plan_id=293]

green · May 13, 2025, 6:14am

Yes, that looks like it’s already doing it automatically then.

I don’t know if you can control these things, but is it perhaps possible to re-partition the larger dataset in such a way that Spark doesn’t have to load all of it, or I guess the alternative is to split it into separate steps and perform the inner join, then union that together. That is more annoying though as you end up with multiple transforms, even if you can template them and have a pipeline automatically generate them for you.

system · July 12, 2025, 6:15am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.