I’m looking for guidance on replatforming a data pipeline from Palantir to Databricks, with an additional requirement to keep both systems in sync during the transition.
Currently, the pipeline consists of multiple source tables that are cleaned and then unioned with other tables to produce a set of downstream datasets. These final datasets are used by a workshop application.
A key requirement is to sync these pipelines with Databricks—whenever new data arrives in Palantir datasets, it should also be propagated to Databricks (ideally in near real-time or via incremental updates).
I’m trying to understand the best approach to both sync and migrate this pipeline efficiently and reliably. Some specific questions I have:
-
What’s the recommended strategy for syncing new/updated data from Palantir into Databricks?
-
Should this be handled via batch jobs, streaming, or CDC (change data capture)?
-
What’s the best way to translate transformations into Databricks (e.g., Spark/Delta)?
-
Are there best practices for handling table dependencies and unions during migration?
-
How should I validate that the migrated pipeline produces consistent results?
-
Any tooling or frameworks that can help automate parts of this process?
Would appreciate any advice, patterns, or lessons learned from similar migrations.