API call to get number of row of a dataset

55b969276dcf43cd2235 · August 9, 2024, 5:04pm

Hello Community team,

I’m currently looking for an API call to retrieve the number of rows in a dataset . I need this to avoid doing using count() in the code, which takes a lot of time for large datasets and very often leads to OOMs (see code below).

I found the foundry satck overflow post here( Utilizing Foundry APIs, how do you get the number or rows and columns for a dataset? - Stack Overflow) where the second simpler solution doesn’t seem to work anymore.

import requests
import json

def getComputedDatasetStats(token, dataset_rid, api_base='https://.....'):
    response = requests.post(
        url=f'{api_base}/foundry-stats/api/computed-stats-v2/get',
        headers={
            'content-type': 'application/json',
            'Authorization': 'Bearer ' + token
        },
        data=json.dumps({
            "datasetRid": dataset_rid,
            "branch": "master"
        })
    )
    return response.json()

token = 'eyJwb.....'
dataset_rid = 'ri.foundry.main.dataset.1d9ef04e-7ec6-456e-8326-1c64b1105431'

result = getComputedDatasetStats(token, dataset_rid)

# full resulting json:
# print(json.dumps(result, indent=4))

# required statistics:
print('size:', result['computedDatasetStats']['sizeInBytes'])
print('rows:', result['computedDatasetStats']['rowCount'])
print('cols:', len(result['computedDatasetStats']['columnStats']))

When I try to do it the call the answer for computedDatasetStats is empty (see response below).

{'datasetRid': 'ri.foundry.main.dataset.eba120ad-a65d-469c-89eb-bfdce138a7be', 'branch': 'master', 'endTransactionRid': 'ri.foundry.main.transaction.00000047-xxxxxxxxxxx', 'schemaId': '0000000-xxxxxxxxxxxx', 'computedDatasetStats': None}

Has anyone ever been able to get this endpoint work for the API call or know another simple and functional api call to realize this?

Best,

nicornk · August 9, 2024, 6:39pm

Does the UI show the row count on the dataset (I think this is below the path on the left side, if not there is a link „Calculate stats“)?

Some kind of compute will have to go through all of your dataset files and sum up the row counts. Either it‘s your count() in the spark code or the stats API will submit a similar spark job which might time out.

One recommendation would be to have your dataset be backed by parquet files which will speed up the stats calculation (spark only has to read the metadata section of the parquet file)

55b969276dcf43cd2235 · August 9, 2024, 7:26pm

Thanks for your answer nicornk,

indeed, the UI shows the total number of lines and when it is not the case, we have the option to click to calculate the number of lines (and the execution is rather fast unlike the count() function in the code). It is exactly the API call behind this click that we would like because we have a use case where we need to use the size of the input dataset in the code.

Our dataset are already backed by parquet files, but since we have very huge dataset(Tb size), calling count() takes forever to run and ends with OOM