Anyone faced this before? How do I set JAVA_HOME - what should I point it to? How do I convert a pandas dataframe to spark sql to use in Jupyter notebook?
Is there a reason why you need to do this in a Jupyter Notebook? A Code Workbook would be a better option since they’re interactive like Jupyter but are backed by Spark to do the code execution, and have good Spark SQL support.
You can follow these instructions to install the necessary dependencies for Spark in your Jupyter notebook in Foundry.
You can then set up a local Spark environment using the code below. (Note that this is not a distributed Spark environment, so your uncompressed data will have to fit into the memory of your container.)
from foundry.transforms import Dataset
from pyspark.sql import SparkSession
# Create the Spark context.
spark = (SparkSession.builder
.appName("ParquetDemo")
.master("local[*]")
.config("spark.driver.memory", "16g") # Define the driver memory.
.config("spark.executor.memory", "16g") # Optionally define the executor memory.
.getOrCreate())
sc = spark.sparkContext
# Download a Foundry dataset of parquet files into the notebook.
parquet_foundry = Dataset.get("parquet_foundry_dataset")
files = parquet_foundry.files().download()
# Read the files into a Spark dataframe.
spark_df = spark.read.parquet(*list(files.values()))
# Apply PySpark transformations.
...
# Write your output dataset.
spark_output = Dataset.get("spark_output")
spark_output.write_table(df_processed.toArrow())