I need to generate notional tabular data (100k+ rows). Pipeline builder seems to have a cap of 1k rows when generating notional data (https://www.palantir.com/docs/foundry/pipeline-builder/datasets-generated/). Does anyone have any suggestions on how to generate more notional data in a programatic way?
1 Like
In Code Repository you can write a transform that leverages a fake data library to help create this data programmatically.
thanks! is this the library you had in mind? https://www.palantir.com/docs/foundry/code-examples/notional-data-generation-transforms/
It seems like the dependencies are dated and behind other required libraries used in my repository.
No specific library in mind. Here is an example using Faker
from faker import Faker
from transforms.api import transform, Output
fake = Faker()
def generate_row(fake):
return {
"name": fake.name(),
"address": fake.address(),
"email": fake.email(),
"job": fake.job(),
"birthdate": fake.date_of_birth(minimum_age=22, maximum_age=90)
}
@transform(
output=Output("<Output_Location>"), # replace with the path to your output dataset
)
def my_transform(ctx, output):
data = [generate_row(fake) for _ in range(100)]
df = ctx.spark_session.createDataFrame(data)
output.write_dataframe(df);