API call to get number of row of a dataset

Hello Community team,

I’m currently looking for an API call to retrieve the number of rows in a dataset . I need this to avoid doing using count() in the code, which takes a lot of time for large datasets and very often leads to OOMs (see code below).

I found the foundry satck overflow post here( Utilizing Foundry APIs, how do you get the number or rows and columns for a dataset? - Stack Overflow) where the second simpler solution doesn’t seem to work anymore.

import requests
import json

def getComputedDatasetStats(token, dataset_rid, api_base='https://.....'):
    response = requests.post(
        url=f'{api_base}/foundry-stats/api/computed-stats-v2/get',
        headers={
            'content-type': 'application/json',
            'Authorization': 'Bearer ' + token
        },
        data=json.dumps({
            "datasetRid": dataset_rid,
            "branch": "master"
        })
    )
    return response.json()

token = 'eyJwb.....'
dataset_rid = 'ri.foundry.main.dataset.1d9ef04e-7ec6-456e-8326-1c64b1105431'

result = getComputedDatasetStats(token, dataset_rid)

# full resulting json:
# print(json.dumps(result, indent=4))

# required statistics:
print('size:', result['computedDatasetStats']['sizeInBytes'])
print('rows:', result['computedDatasetStats']['rowCount'])
print('cols:', len(result['computedDatasetStats']['columnStats']))

When I try to do it the call the answer for computedDatasetStats is empty (see response below).

{'datasetRid': 'ri.foundry.main.dataset.eba120ad-a65d-469c-89eb-bfdce138a7be', 'branch': 'master', 'endTransactionRid': 'ri.foundry.main.transaction.00000047-xxxxxxxxxxx', 'schemaId': '0000000-xxxxxxxxxxxx', 'computedDatasetStats': None}

Has anyone ever been able to get this endpoint work for the API call or know another simple and functional api call to realize this?

Best,

Does the UI show the row count on the dataset (I think this is below the path on the left side, if not there is a link „Calculate stats“)?

Some kind of compute will have to go through all of your dataset files and sum up the row counts. Either it‘s your count() in the spark code or the stats API will submit a similar spark job which might time out.

One recommendation would be to have your dataset be backed by parquet files which will speed up the stats calculation (spark only has to read the metadata section of the parquet file)

Thanks for your answer nicornk,

indeed, the UI shows the total number of lines and when it is not the case, we have the option to click to calculate the number of lines (and the execution is rather fast unlike the count() function in the code). It is exactly the API call behind this click that we would like because we have a use case where we need to use the size of the input dataset in the code.

Our dataset are already backed by parquet files, but since we have very huge dataset(Tb size), calling count() takes forever to run and ends with OOM

I don’t know if this is helpful or not, but you could use deployed pipelines to trigger a spark job to compute the stats. Deployed pipelines allow you to submit parameters to your transforms in code repositories from workshop. They will even create a feature branch and send the output to your ontology. You can view the job progress in workshop and see the updates when they are done. We use this for many features that require processing huge amounts of data such as A/B tests. The documentation is non existent though. I have one PDF I can post that I received from Palantir that walks you through the process if you are interested.