Spark sql in Jupyter notebook

For debugging purposes, when I try to use spark SQL in Jupyter notebook, I am not able to create a spark session because JAVA_HOME is not set.

spark = SparkSession.builder.appName(“PandasToSpark”).getOrCreate()

Anyone faced this before? How do I set JAVA_HOME - what should I point it to? How do I convert a pandas dataframe to spark sql to use in Jupyter notebook?

Is there a reason why you need to do this in a Jupyter Notebook? A Code Workbook would be a better option since they’re interactive like Jupyter but are backed by Spark to do the code execution, and have good Spark SQL support.

Hi @abudaniel,

You can follow these instructions to install the necessary dependencies for Spark in your Jupyter notebook in Foundry.

You can then set up a local Spark environment using the code below. (Note that this is not a distributed Spark environment, so your uncompressed data will have to fit into the memory of your container.)

from foundry.transforms import Dataset
from pyspark.sql import SparkSession

# Create the Spark context.
spark = (SparkSession.builder
    .appName("ParquetDemo")
    .master("local[*]")
    .config("spark.driver.memory", "16g")  # Define the driver memory.
    .config("spark.executor.memory", "16g")  # Optionally define the executor memory.
    .getOrCreate())
sc = spark.sparkContext

# Download a Foundry dataset of parquet files into the notebook.
parquet_foundry = Dataset.get("parquet_foundry_dataset")
files = parquet_foundry.files().download()

# Read the files into a Spark dataframe.
spark_df = spark.read.parquet(*list(files.values()))

# Apply PySpark transformations.
...

# Write your output dataset.
spark_output = Dataset.get("spark_output")
spark_output.write_table(df_processed.toArrow())
1 Like