Allow Usage of JDBC Sources in external transforms

nicornk · August 26, 2024, 2:38pm

Hi

Context:

Regular Syncs, created in the UI allow only a limited set of features, especially when dynamic logic decisions need to be made while ingesting data. A limited incremental concept is supported, based on filtering incremental columns.

Table based Exports only support a subset of legacy Export Tasks (no pre-, or post-sql features).

Pro-code users need more flexibility while working with JDBC or other Tabular Sources. Writing code while ingesting from a dataset allows for more flexible decisions about queries and result sets.

The ask is to allow usage of JDBC/Tabular Sources (e.g. Postgres) in external transforms in code repository.

The Python API could offer a functionality to send arbitrary sql queries and return an ARROW RecordBatch which can be persisted to the foundry filesystem or read into Spark for further modification.

Agent Proxy infrastructure could be leveraged to call into Agent-bound Sources.

Why we cannot do it today:

Feature does only exist for REST Sources.

Workarounds:

Perform ingestions outside of Foundry
Manage ingestion without Source and add JDBC driver as maven dependency. (Does not work for on-premise Sources)
For Exports use legacy Export Tasks.

Benefits:

Pro-code ingestions could be scripted much better to achieve flexible incremental ingestions.