I work with Jupyter Code Workspaces. When reading a dataset which consists of CSV files, I get many string columns. In my case, this happens, because only bare CSV files are read. However, the dataset has several different data types saved in schema. How can I read Foundry data types while reading the CSV dataset?
Hey @ZygD,
Just to make sure I understand correctly the situation.
You created a dataset in Foundry, uploaded a raw csv file. You get something like in the following screenshot (with Columns being of a certain type, here integer).
Then you created a Jupyter Code Workspace and try to read this dataset but you only get string columns ?
Which dataset format are you using to read the dataset in the Code Workspace ?
Best !
Thank you for the reply. Sorry I have not provided enough details at the beginning.
I am using Spark, because I’m just testing a small part of a bigger Spark pipeline.
spark = SparkSession.getActiveSession()
ds_foundry = Dataset.get('my_inp')
files = ds_foundry.files().download()
options = {'header': 'true', 'quote': '"', 'escape': '"'}
df = spark.read.options(**options).csv(*list(files.values()))
df.select('R_100').dtypes
[('R_100', 'string')]
The type in the Foundry dataset is set to integer, just like in the example of yours.
Hey @ZygD,
Thanks for the precisions !
I think the problem comes from the fact that you are downloading the raw CSV files (which makes you loose metadata of the dataset, including the schema) in your code and then reading them via the Spark API.
I would suggest trying to read the Foundry Dataset as a pandas dataset and then reading that pandas dataset to create the Spark dataset.
Something like :
from foundry.transforms import Dataset
from pyspark.sql import SparkSession
csv_files_with_schema = Dataset.get("csv_files_with_schema").read_table(format="pandas")
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(csv_files_with_schema)
Tested it on my side, it preserved the types
spark_df.select(‘A’).dtypes -> [('A', 'bigint')]
Best !
