Hello! I have a question about architecting pipelines in Foundry that I need advice on.
For our product, we integrate data from customer ERPs into a unified ontology and build a product on top of that ontology. There are ~10 possible ERP options, and each customer has 1 ERP integration.
Each integration with a particular customer’s system will require some customization/configuration. However, we think that ~90% of the transformations will be shared between customers on the same source system.
For example, if two customers are using Netsuite, one customer may use a customer field for a property, while another uses a stock Netsuite field.
My question is, how can I sustainable architect a system like this in Foundry (where 90% of the code is shared, but 100% of integrations require some customization on a per-customer basis)? When I make updates to the shared code, I want those to be deployed to all customers, but I want to be able to override or change those per customer.
I would love not to have to write giant if statements… Any ideas?
I suggest designing your pipeline to promote code reusability, here are two suggestions:
Having shared repositories that house same domain logic and can have shared utilities. For example: You can have a cleaning layer that runs after all raw data lands on Foundry. All the cleaning transformations would reside in this repository. You can build utility libraries within it and import them into each transformation. If you need to override one of them, you go to that individual transformation and make the necessary adjustments.
Maintain a library of shared code in its own repository and import it into each relevant repository to be used within a transformation. Documentation.
If you can design the shared code in a highly modular fashion, you can easily override and rewrite significant portions of a transformation.
Can we discuss these approaches in more detail? It is easy to say “write shared code”, but doing it in practice is much more challenging in Foundry.
For example, are you suggesting that we:
Have N pipelines for N customers. Where each pipeline uses the shared code as a framework, but the per-customer overrides happen in each pipeline repo?
Have 10 pipelines, one for each ERP we support integrating with. Then, for every new customer, we union their data connection together (not sure how to do that programmatically) and then apply the custom overrides (which live in code, I guess?)
Create Marketplace products of the syncs for each ERP system (you’ll always need to manually make a new source for each customer, at least with today’s capabilities, because of credentials at the very least). Make sure this product installs in “bootstrap” mode so the syncs can be customized as needed for each customer. Alternatively, if the tables are closer to 99% the same across customers, then you could consider installing these syncs in “production” mode so they will pick up updates you make centrally; you would then need to add an additional sync or two per customer for custom tables.
For the pipeline (I’m assuming you’re using Python), conceptually I think you either want to (a) map everything to a common schema and then apply all transforms, or (b) programmatically modify your processing logic to accommodate schema deviations across customers.
For (A), you could use transform generators to create each transform for each table you care about that needs its schema mapped to the centralized schema. For tables that need custom handling, you could simply omit that table from the iterable the transform generator uses. You would handle that table as a special case. You could centralize the logic that the transform generator uses in a Python library, which would allow for reuse in the same way libraries do outside Foundry.
For (B), you could have a library of transforms that are managed in a central library, and which you modify via decorators for each customer (e.g. the decorator might do the schema mapping and accept per-customer logic for that). You could leverage a “bootstrap” Marketplace product for installing the initial Python repo. Any updates to the processing logic could be propagated by updating the libraries in each repo. You could improve the workflow of that with a script that pulled all the repos into GitHub or some other system and then use scripts that update the dependencies before pushing the changes to all the Foundry repos. I’m assuming one repo per customer, since I assume CI/CD checks would take a very long time with a sufficiently large repo.
To sum it up, I would prefer to coerce all the schemas to a common schema first, and then apply a single transforms pipeline to them. The tools I would rely on are:
Python libraries
Potentially custom decorators
Transforms generators
Marketplace products installed in “bootstrap” or “production” mode depending on the situation
Yes, in this context, I’m referring to a pipeline as a collection of transforms, where each transform is a unit of work that takes an input(s) dataset and outputs its own dataset. All of these transforms can reside in one repository. For example: all cleaning transforms are in cleaning repository, that house similar logic, but slightly different per each dataset. Then within that repository you can also store shared code.
Within each transform, you can modify dataset-specific logic and adjust the compute requirements per transformation. This approach not only allows for easy customization but also allows you to optimize for resource usage.
Then you can schedule to run all of those transformations at the same time if this is desired.
Exactly the type of answer I am looking for, thank you.
The syncs would definitely be more like 99% the same (I hope).
The pipelines is where there will 100% be pipeline difference.
Do you think there is a way to accomplish what we are describing via Marketplace with Pipeline Builder? It would be incredible to be able to enable my non-technical but data-literate co-founder to help with the maintenance of these pipelines. Most of the work I need to do in these pipelines is:
Get the data in for each customer and coerce it to a common schema. This is where you’ll accumulate some tech debt no matter what approach you choose due to the need to have custom logic to handle the data.
Create a pipeline (or series of pipelines) that can be deployed via a “production” Marketplace product. Your colleague can write these and update the product when changes need to go out. Usual caveats apply about breaking changes, etc.
I’m not sure what the end goal is – one unioned dataset of everything? If yes, then although you can do that in PB I would instead write a small Python transform that does it because for me that’s more ergonomic (e.g. you could show your colleague how to add an input to the transform and it would just work)