-
What are the tradeoffs between performance, storage space, compute cost, etc in having 1 object with 250M rows and 250 columns that multiple use cases pull from VS tens of objects with 5M rows and tens of columns consisting of some subset of the 100s^ and additional other columns?
-
If performing double aggregates on columns (ex: summing col X grouped by Y and then taking the max of the summed X), what are top design options? Storing summed X in a col for Workshop/FE to compute last aggregate step or using repo functions for both aggs or computing fully in pipeline builder/BE transforms into an Object? Gut instinct is that if the frontend interactions don’t change the computations then they should be computed in the backend and stored in an Obj?
-
How do the answers to these questions change based on how often the backing dataset of the object is updated?
Just to be clear; would you be sharding the object type or would these additional types be a result of normalisation?
My 2¢
- What is the requirement for timeliness? Large, single entities can take a while to update.
- Does the information across the (potential) multiple object types update asynchronously? If so, normalisation can make sense to split the load
- How much of the data is consumed in the use case? If it’s a case of only a historical aggregate + last 30 days, you can consider an incremental approach with a metadata object. Two columns with the aggregate + the rows aggregated over is much quicker to bring up to date, than rolling the entire thing every time.
- Profile the workload – if no use case is using a large part of these columns, a split makes sense.
- Performance: Some caching, compression and dictionary reuse is lost when working with extremely large objects. That said, for simple aggregations, Workshop’s front end transforms can be surprisingly quick. TS aggregations over large datasets often run into memory issues.
- Pricing: I believe that in sparse tables, nulls are still stored in source and metadata, which could end up costing more. You usually also pay for deserialisation of the columns you’re not using. Compute would likely be 2-3x higher, as each write touches many more columns.
It’s very hard to provide general rules about this, as normalising too much has it’s own problems (e.g. duplication-issues), and handling 10 objects is easier than handling 100.
If you’re looking for something more specific, some details on your use case would be helpful.
2 Likes