Currently I have 2 big datasets of around 10Tb each:
- One of them is just historical data that we don’t process anymore, but keep all the downstream datasets
- The second is running incrementally and every day we add a new day of data.
I would like to do any of the following things:
- The first historical dataset I would like to archive it in the cheapest way possible, this data potentially wont be accessed anymore, worst case scenario we delete it, but then given the size of it I hope that just press “delete” works or is there any other concern?
- The second dataset I would like to partially archive it or reduce his size somehow, in reality we never access old data, we just access the incremental data of the day to aggregate downstream and we rarely/never “reprocess” the data, so making it lighter would help
Are there any ideas how we can tackle this problems?