Iceberg Virtual tables limitation

We would like to use apache iceberg inside aws using glue data catalog but we have this question , can you confirm or not this is true : If virtual tables are out because they only support append only transactions (for now) and BYOB means AWS storage + Foundry managed catalog will the iceberg data really be useful to other AWS processes like Redshift with Foundry managing the catalog? The decision then just becomes one of cost efficiency for storage via BYOB vs Foundry. I don’t have a definitive answer for this but that’s my understanding of the Foundry docs

I don’t fully understand your concerns, but I‘ll try to answer.

There are two layers: storage and catalog.

If you use foundry managed storage and foundry iceberg rest catalog (IRC), you won’t be able (today) to access your iceberg tables from outside (eg AWS).

Since data access from AWS is a requirement for you, you will need to go for BYOB as well as deactivate the double encryption (as only foundry supports this piece of the iceberg spec today).

If you know use foundry IRC or Glue as a catalog depends on your setup and requirements. You could also use a mix of both.

Foundry IRC managed iceberg tables are fully integrated into foundry: when your users click New → Table, they get directly an iceberg table, without the need for access to the underlying Source.

With external managed iceberg tables users need to go to the Source and have access to it to create tables.

Thank you @nicornk for your answer.

Our goal is to use Glue Catalog as Iceberg catalog within AWS world so we can share the underlying tables with other teams or services like Redshift for example.

But while reading the Foundry docs we saw that only option for us is virtual table but unfortunately, they only support Incremental and no other type of transaction.

Do you know a case implementation of BYOB option ?

How would you use the mix of the both like you mentioned in your reply foundry IRC or Glue ?

Thanks

Fundamentally there is no difference when you use a „virtual“ iceberg table or a foundry iceberg tables in transforms.

In both cases you can use the „changelog“ mode in incremental which will expose the underlying iceberg view to you, or you can call the iceberg procedures yourself.

https://palantir.com/docs/foundry/iceberg/changelog-code-examples/

Be careful, efficient CDC without full table scans is only supported since iceberg v3.

1 Like

Thank you @nicornk, this is very helpful — especially the clarification that virtual and managed Iceberg tables behave the same way in transforms with changelog mode.

A few follow-up questions based on our discussions with the Foundry platform team:

On the hybrid approach (Foundry IRC + Glue): You mentioned we could use a mix of both catalogs. In our case, we need some tables to be accessible from AWS Redshift while also benefiting from Foundry’s branching. Has anyone implemented this hybrid pattern in practice? Specifically, is there a recommended way to handle a table that needs both Foundry branching for development workflows and external access from Redshift?

One idea we’re exploring is using Foundry IRC as the primary catalog with BYOB, and then leveraging AWS Glue catalog federation to federate Foundry’s REST catalog into Glue - so Redshift queries go through Glue but the source of truth remains Foundry IRC. Has anyone tried this path?

On double encryption: You mentioned we need to deactivate double encryption for external services to read the data. Is this a per-table setting or an enrollment-wide configuration? And are there any other Iceberg spec features that Foundry uses which might cause compatibility issues with Redshift or other engines reading the files directly from S3?

On Iceberg version: You mentioned efficient CDC without full table scans requires Iceberg v3. Which Iceberg spec version does Foundry currently write for managed and virtual tables? And is there anything we need to configure to ensure v3 is used?

On BYOB setup: The platform team mentioned that BYOB setup requires working with Palantir support. Do you know of any existing documentation or reference architecture for the BYOB + external catalog access pattern? Any gotchas we should be aware of during setup?

Thanks again for your time - this community input has been incredibly useful for our migration planning.

One idea we’re exploring is using Foundry IRC as the primary catalog with BYOB, and then leveraging AWS Glue catalog federation to federate Foundry’s REST catalog into Glue - so Redshift queries go through Glue but the source of truth remains Foundry IRC. Has anyone tried this path?

I haven’t tried this as our Foundry IRC Endpoint is not open to the Internet and thus to Glue. One thing to watch out is that Glue Federation seems to require IAM Access to the iceberg table bucket (“Create an IAM role that Lake Formation can use to vend credentials and attach permission on S3 bucket prefixes where the Iceberg tables are stored.” [1]) - I would have liked if Glue would be able to passthrough the vended credentials instead of vending on it’s own. If you can live with these limitation it should work.

On double encryption: You mentioned we need to deactivate double encryption for external services to read the data. Is this a per-table setting or an enrollment-wide configuration?

This can be configured in control panel on namespace or project level.

On Iceberg version: You mentioned efficient CDC without full table scans requires Iceberg v3. Which Iceberg spec version does Foundry currently write for managed and virtual tables? And is there anything we need to configure to ensure v3 is used?

To my understanding Foundry supports v3, I have not tested it and I am not sure if there are special settings required. In general, Foundry is close to the “Spark-iceberg” standards. Maybe someone from Palantir side can chime in.

Do you know of any existing documentation or reference architecture for the BYOB + external catalog access pattern? Any gotchas we should be aware of during setup?

As long as you can configure a S3 Source to your bucket will the required permissions this should work. You than register this S3 Source for a specific Namespace or project in Control Panel.

[1] https://aws.amazon.com/blogs/big-data/introducing-catalog-federation-for-apache-iceberg-tables-in-the-aws-glue-data-catalog/

1 Like

One potential limitation, it appears syncing Iceberg virtual tables from S3 only supports APPEND and not changelog (insert, update, and delete): https://www.palantir.com/docs/foundry/available-connectors/amazon-s3/#virtual-tables

Or perhaps the docs are slightly outdated?

Iceberg versions: Foundry supports both v2 and v3. v2 is currently the default when writing tables without client-side encryption enabled, and v3 is the default when writing tables with client-side encryption enabled. If you have client-side encryption disabled (e.g. for better interoperability with third-party tools), you can optionally write tables with v3 instead of v2 by specifying the table version in your create table request from external tools. We don’t currently allow you to choose the version when writing inside Foundry using Transforms or Pipeline Builder.

Foundry IRC BYOB + Glue catalog federation (Redshift queries through Glue). From our perspective this should work. You can register Foundry’s Iceberg catalog with Glue. I’m not sure if Glue supports Redshift queries with catalog federation (question for AWS).

Full CDC vs append-only incremental support. You can create CDC pipelines in Foundry using Python Transforms on both Foundry-catalog managed Iceberg tables and virtual Iceberg tables. For ingesting data via a Data Connection sync into Foundry, we don’t yet support syncs into Iceberg (this is under development), so those would typically be append-only into datasets. As an alternative, you could write an external transform to write directly from the source into an Iceberg table in Foundry in a CDC manner.

1 Like

Enabling Data Connection syncs to use full CDC into Iceberg could be a huge value unlock, especially if combined with automatic table registration. Thank you for weighing in!

1 Like

Hi all,

Following up on our earlier discussion about Iceberg migration with BYOB, we’ve hit a concrete limitation while building our pipelines.

The issue:

When using @lightweight with Polars and writing to an Iceberg table output in incremental mode, we get:

NotImplementedError: Table outputs are not yet supported in lightweight incremental builds

The full traceback points to _table_output.py line 421 in get_incremental, which explicitly raises this as not yet implemented.

Our context:

We’re migrating from Foundry datasets to Iceberg tables (BYOB + Foundry IRC setup, as discussed in this thread). Many of our pipelines use @lightweight with Polars for performance, and they run in incremental mode to process only new rows on each build. Moving to Iceberg as the output format breaks these pipelines because of this limitation.

What we’ve considered so far:

  1. Switch to Spark-backed transforms for any pipeline writing to Iceberg tables incrementally. This works but we lose the Polars performance advantage and the lower compute cost of lightweight.

  2. Keep lightweight but write to a dataset output first, then sync to Iceberg in a separate step. This adds pipeline complexity and an extra hop.

  3. Keep lightweight but drop incremental (run full snapshots). Not viable for our larger tables.

Questions for the community:

  1. Has anyone found a workaround for incremental Iceberg table writes in lightweight? For example, is there a way to manually handle the incremental logic (reading only new partitions, appending to an Iceberg table) within a non-incremental lightweight transform?

  2. @sgershkon — is there a timeline or roadmap for supporting Iceberg table outputs in lightweight incremental builds? Given the push toward Iceberg as the default table format, this feels like a gap that will affect many teams migrating from datasets.

  3. For those of you running BYOB + Iceberg in production: what does your pipeline architecture look like? Are you using Spark for all Iceberg writes, or have you found ways to mix lightweight and Spark transforms effectively?

  4. One more thought — would it be possible to use PyIceberg directly within a lightweight transform to manually append Iceberg data files, bypassing the transforms framework’s table output? Something like loading the table via the Foundry Iceberg REST catalog, writing new Parquet files, and committing them as a new snapshot. Has anyone tried this approach?

Thanks again for the help — this community thread has been incredibly valuable for our migration planning.

1 Like

Hey definitely hear you on needing single-node incremental support for Iceberg.

Our team is working on adding (append-only) incremental for both single-node Python transforms and Pipeline Builder. Expect this is probably a couple months out from becoming available.

I wonder if something like this could work if you wanted to manage incremental logic manually in the meantime?

from transforms.api import LightweightContext, transform
from transforms.tables import TableLightweightOutput, TableOutput, TableInput, TableLightweightInput


@transform.using(
    source_table=TableInput("/path/input"),
    output_table=TableOutput("/path/output")
)
def compute(ctx: LightweightContext, source_table: TableLightweightInput, output_table: TableLightweightOutput):
    conn = ctx.duckdb().conn

    # Register source
    source_reader = source_table.iceberg().table().scan().to_arrow_batch_reader()
    conn.register("source_tbl", source_reader)

    # Get current max id from output (handle empty/first-run case)
    try:
        output_reader = output_table.iceberg().table().scan().to_arrow_batch_reader()
        conn.register("output_tbl", output_reader)
        max_id = conn.sql("SELECT COALESCE(MAX(id), -1) AS m FROM output_tbl").fetchone()[0]
    except Exception:
        # Table doesn't exist yet or is empty
        max_id = -1

    # Select only new rows
    query_arrow = conn.sql(
        f"SELECT * FROM source_tbl WHERE id > {max_id}"
    ).to_arrow_table()

    output_table.iceberg().write(query_arrow, mode="append")

By the way, realizing you can actually set the table version inside transforms as well if you write directly to the Iceberg table (not using our transforms write API).

Something like this should work:

@transform.spark.using(
    output=TableOutput("/path/table_name"),
)
def compute(ctx: TransformContext, output: TableTransformOutput) -> None:

    df_custom = ctx.spark_session.createDataFrame([["Hello"], ["World"]], schema=["phrase"])

    df_custom.writeTo(output.identifier)\
    .using("iceberg") \
    .tableProperty("format-version", "3") \
    .createOrReplace()

We’ll add some documentation about this.

1 Like