Which is the better configuration to parallelly run docker containers with transform?

ChengYangUmich · June 21, 2024, 7:41am

Problem Statement:
A project requires to optimize certain inputs of a model to achieve the wanted output targets. However, the model is wrapped in a docker container and therefore, it is not possible to directly put the model into the transform. Therefore, the sidecar is needed for this project.

Option A: Fulfill the parallel running with multiple container instances

Document Source: https://www.palantir.com/docs/foundry/transforms-python/transforms-sidecar/#example-2-parallel-execution

Just as in the example provided, define all the combination of inputs as a data frame.

| policyId | param 1 | param 2 | ... |
| ---------|   0.1   |   0.2   | ... |  
| ---------|   0.2   |   0.4   | ... |  
| ---------|   0.3   |   0.5   | ... |

Within the transform, define a main function that format each row into input files and trigger multiple container instances like below

def main(a_row):
   
@trainsform():
    def compute(output, output_rows, source, ctx):

        def main(a_row):
            format_inputs_into_shared_volume()
            copy_start_flag()
            wait_for_done_flag()
            copy_output_files()
            post_processing_results()

    results = policy.dataframe().repartition(4).rdd.map(main)

Option B: Fulfill the parallel running with one container instance
Basically, utilize some parallel job libraries like Ray [https://docs.ray.io/en/latest/ray-overview/index.html], wrap everything within the container and the transform only trigger the instance once.