I’ve recently noticed that Iceberg tables are available on the platform (currently in Beta) and have a few questions, as the documentation isn’t entirely clear.
Cost considerations: How does the cost of using Iceberg tables compare to regular datasets, especially for large datasets (~30 TB)?
Storage and setup: The setup options are a bit confusing—regular vs. virtual tables, using our own S3, and the upcoming native Foundry storage support. Could you clarify the available configurations and any recommended best practices?
Incremental transforms: From what I understand, it’s possible to run incremental transforms in code repositories that read PySpark datasets and write to Iceberg tables. Is that correct? Are there any limitations or recommendations for handling datasets of this size?
Storage - Iceberg stores data in Parquet files, just like traditional Foundry catalog datasets, so overall data size is very similar. Iceberg does add a small amount of storage overhead via its table metadata files, but these are typically very small relative to the size of the underlying data, especially for larger tables.
Compute - queries on Iceberg tables are often faster and more scalable, especially as data volume and complexity grow. This can translate into lower compute costs.
Iceberg provides built-in maintenance procedures which can add optimizations for both performance and storage. For example:
Compaction: Rewrites many small Parquet files into fewer, larger ones, improving query performance (especially for pipelines which produce many small files)
Orphan file cleanup: Removes unused files that are no longer referenced by the table, reducing storage usage.
Re storage and setup
You can use Iceberg with virtual tables (where you have external storage & an external Iceberg catalog) or managed tables (where you use Foundry’s Iceberg catalog, and either external or Foundry-managed storage).
Most likely if you are looking to replace existing Foundry datasets with Iceberg tables, “managed Iceberg” would be what you’re after, i.e. using Foundry’s Iceberg catalog. For managed storage, currently we support in Beta:
Bring-your-own-bucket where you configure your own storage bucket in S3 or ABFS. This is the better option where you want to own and have full control over the storage bucket. For example, for maximum interoperability with other tools.
Managed storage is very newly available in early access. This would be using Foundry native storage. This is the better option if you don’t want to worry about creating and managing your own storage bucket separate from Foundry.
Re Incremental transforms
You can setup an incremental identity transform to read from a traditional Foundry dataset and write to an Iceberg table. Note that to preserve the full history in Iceberg you would need to run a first snapshot which would be a more resource-intensive operation but from there running incrementally should be no issue.
Please do also feel free to reach out via your Palantir team for a more detailed conversation.