Hello Community team,
We have a use case for optimizing our resource costs where we want to dynamically define the profiles to be used during the build of our transform according to the number of partitions in a dataset:
To do this, as a test, we have defined 2 simple functions before the transform:
getDatasetFilesStats(), which calculates the number of partitions in the table via an api call and (see code below)
choose_profile(), which defines the profile to be used according to the number of partitions, and which we pass as a configure parameter. (see code below)
Everything works fine in preview mode. However, when we launch the build, we get a host connection error : Failed to create connection to host, This could be from the device being busy or an error in name resolution. Please retry
has anyone came across a similar use case/issue? I can only imagine that it’s because the call API is made before the transform (because by moving the same call api in the transform the build works) and that something is missing in relation to the host, but we don’t know what.
Your help in solving this issue or giving us some other ideas to achieve this use case will be greatly appreciate.
Best,
Wilfried
def getDatasetFilesStats():
token = token_value
response = requests.get(
url = "host/api/v1/datasets/{}/files".format(datasetRid),
headers={
'content-type': 'application/json',
'Authorization': 'Bearer ' + token
},
data=json.dumps({
"datasetRid": datasetRid,
"branch": "master"
})
)
return len(response.json()["data"])
def choose_profile():
number_partition = getDatasetFilesStats()
print("number_partition", number_partition)
if number_partition > 50 :
profile = ["DRIVER_MEMORY_SMALL"]
else :
profile = ["DRIVER_MEMORY_MEDIUM", "NUM_EXECUTORS_4"]
print("profile chosed", profile)
return profile
@configure(choose_profile())
@transform(
output=Output("output_rid"),
input_df=Input("input_rid"),
)
def compute(ctx, input_df, output,):
output.write_dataframe(input_df.dataframe())