Generating notional data

michelle · July 8, 2024, 8:27pm

I need to generate notional tabular data (100k+ rows). Pipeline builder seems to have a cap of 1k rows when generating notional data (https://www.palantir.com/docs/foundry/pipeline-builder/datasets-generated/). Does anyone have any suggestions on how to generate more notional data in a programatic way?

bkaplan · July 8, 2024, 9:00pm

In Code Repository you can write a transform that leverages a fake data library to help create this data programmatically.

michelle · July 8, 2024, 9:30pm

thanks! is this the library you had in mind? https://www.palantir.com/docs/foundry/code-examples/notional-data-generation-transforms/

It seems like the dependencies are dated and behind other required libraries used in my repository.

bkaplan · July 8, 2024, 10:03pm

No specific library in mind. Here is an example using Faker

from faker import Faker
from transforms.api import transform, Output

fake = Faker()

def generate_row(fake):
    return {
        "name": fake.name(),
        "address": fake.address(),
        "email": fake.email(),
        "job": fake.job(),
        "birthdate": fake.date_of_birth(minimum_age=22, maximum_age=90)
    }

@transform(
    output=Output("<Output_Location>"),  # replace with the path to your output dataset
)
def my_transform(ctx, output):
    data = [generate_row(fake) for _ in range(100)]
    df = ctx.spark_session.createDataFrame(data)
    output.write_dataframe(df);