I have 2 datasets. One is a small number of timeseries ids (1k - 15k) and the other is a 8.3 trillion row timeseries dataset of size 50+ TB. The timeseries has a soho projection associated to it since it is indexed as a timeseries. My goal is to perform an inner join of the two to effectively just extract the timeseries with the timeseries ids stored in the smaller dataset.
When performing an inner join between these datasets, spark seems to try to batchscan the entire 50 TB dataset which takes an extremely long time and is very costly.
Whats the most efficient way to perform this operation?
Have you tried adding the ‘broadcast’ hint to the join query for the smaller dataset? Hopefully Spark is already doing that automatically, but it doesn’t always, and that can help if the smaller dataset is so small it can be sent to every node.
More complex options exist as well: https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints-for-sql-queries
Yes, that looks like it’s already doing it automatically then.
I don’t know if you can control these things, but is it perhaps possible to re-partition the larger dataset in such a way that Spark doesn’t have to load all of it, or I guess the alternative is to split it into separate steps and perform the inner join, then union that together. That is more annoying though as you end up with multiple transforms, even if you can template them and have a pipeline automatically generate them for you.