PySpark Review | Seunghyun (Joe) Lee

These are some snippets from the Data Analysis with PySpark Book! It’s been a couple months since I reviewed this so I wanted to keep notes to help me remember.

Basic Concepts:

Fundamentally, the concept of PySpark is to scale out with smaller and more machines instead of scaling up as it is often more cost efficient.

When a task (also called a driver program) is launched, the cluster manager is responsible for allocating resources across the cluster to support it.
The available capacity and resource information is stored in the SparkContext, which acts as the connection to the Spark cluster.
In the illustrated “data factory” metaphor, the worker nodes (or simply, workers) are the work desks/stations that provide compute and memory resources.
The executors are like the workers at each station — they carry out the actual tasks assigned to them.
The master node is represented by the figure wearing the hat — think of it as the factory manager or owner. It allocates tasks to each workstation and oversees the workflow across the factory.

Lazy Execution:

The way I understand lazy execution in Spark is that there are two types of operations: transformations and actions.
Transformations include operations like adding columns, aggregations, or generating summary statistics — these define what to do, but don’t execute immediately.
Actions are operations that trigger execution, such as writing to disk, collecting results, or showing data (e.g., .show(), .collect()).
The “lazy” part of Spark means that it doesn’t actually execute any transformations until an action is called.
For example, you could add a thousand columns, but until you perform an action, Spark simply records those operations as a plan — nothing is computed yet.
Why is this good? Because storing a plan of instructions requires much less memory than executing each step immediately.
It also allows Spark to optimize the execution plan (e.g., reordering operations for efficiency) before running anything.
Lastly, this laziness makes it possible to reuse transformation plans, improving consistency and potentially reducing redundant work.

PySpark Code:

Initialization

The entry point to Spark is through SparkSession, which allows to interact with the Spark cluster.

from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder # <-- Builder Pattern Abstraction!
    .appName("Some Name!") 
    .getOrCreate()  # work both in batch & interactive mode 

# Set logging level to 'FATAL' to avoid unnecessary log outputs
spark.sparkContext.setLogLevel("FATAL")

EDA of data

For text files, it is pretty straightforward as .read.text() takes the file path.
The data is read with each line in each row for a DataFrame object.
We can confirm with the schema for the DataFrame as it has a single string-valued column.
.show() allows to display the actual data, with the truncate parameter deciding how long it should display the content.

data = spark.read.text("../../data.txt")

# Show schema
data.printSchema()

#root
# |-- value: string (nullable = true)

# Display the first few rows
data.show(5, truncate=40)

#+----------------------------------------+
#|                                   value|
#+----------------------------------------+
#|The Project Gutenberg EBook of Pride ...|
#|                                        |
#|This eBook is for the use of anyone a...|
#|almost no restrictions whatsoever.  Y...|
#+----------------------------------------+

Analyze text data

To analyze text data, we want the words itself, so we split the lines by spaces.
Very similar to SQL, we use the select() statement to execute operations on the DataFrame.
Note that .alias() allows it to rename the column once it is split, which we can confirm by viewing the new DataFrame.

from pyspark.sql.functions import split

# Split the text into words
lines = book.select(split(data.value, " ").alias("line"))

# Display the first few rows of the split data
lines.show(5)

#+--------------------+
#|                line|
#+--------------------+
#|[The, Project, Gu...|
#|                  []|
#|[This, eBook, is,...|
#|[almost, no, rest...|
#|[re-use, it, unde...|
#+--------------------+

We can also see that the schema changed from a single string-column to a column called line taking an array of strings.

lines.printSchema()

#root
# |-- line: array (nullable = true)
# |    |-- element: string (containsNull = false)

To get the words from an array of strings, we use the function .explode() to transform all arrays into each separate rows in the dataframe.
Notice how the schema is changed to just a column of words now.

from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))

words.show(10)

#+----------+
#|      word|
#+----------+
#|       The|
#|   Project|
#| Gutenberg|
#|     EBook|
#|        of|
#|     Pride|
#|       and|
#|Prejudice,|
#|        by|
#|      Jane|
#+----------+
#only showing top 10 rows

words.printSchema()
#root
# |-- word: string (nullable = false)

After splitting and exploding the words, we often need to clean the data by normalizing the text, such as converting all words to lowercase and removing unwanted characters.
To convert to lowercase, we use lower(). To remove non-alphabetical characters, we use regexp_extract(), which extracts matching characters from the string in regex manner.

from pyspark.sql.functions import lower, regexp_extract

# Convert words to lowercase
words_lower = words.select(lower(col("word")).alias("word_lower"))

# Remove non-alphabetic characters
words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

# Show the cleaned words
words_clean.show(5)

#+---------+
#|     word|
#+---------+
#|      the|
#|  project|
#|gutenberg|
#|    ebook|
#|       of|
#+---------+
#only showing top 5 rows

After cleaning, we may have some empty strings or special characters that we don’t want to analyze.
We can filter these out by applying a condition to remove empty strings.

# Filter out empty words
words_nonull = words_clean.filter(col("word") != "")

Now that we have clean words, we can group the words and count their occurrences. This is done using groupBy() and count().
You can also sort the results by frequency to find the most common words in the text.

# Group words and count their occurrences
word_counts = words_nonull.groupby(col("word")).count()

# Show the most frequent words
word_counts.orderBy(col("count").desc()).show(10)
#+----+-----+
#|word|count|
#+----+-----+
#| the| 4205|
#|  to| 4121|
#|  of| 3662|
#| and| 3309|
#|    | 2776|
#|   a| 1944|
#| her| 1858|
#|  in| 1813|
#| was| 1795|
#|   I| 1740|
#+----+-----+

Once we have analyzed the word counts, you may want to save the results for further analysis or for use in other systems.

# Save the word counts to a CSV file
word_counts.write.csv("word_counts.csv")

Refactoring what we have so far

So, we got text data and transfromed it into word counts. However, there are a lot of lines of code and it does get a bit confusing to read. A better way is to chain the commands for better readability.
Below is the same code that we have gone through but chained as one results dataframe.

results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("lower_word"))
    .select(F.regexp_extract(F.col("lower_word"), "[a-z']*", 0).alias("clean_word"))
    .filter(F.col("clean_word") != "")
    .select(F.substring(F.col("clean_word"), 0, 1).alias("char"))
    .groupby(F.col("char"))
    .count()
    .orderBy(col("count").desc())
)

This was a very brief introduction to PySpark, and I think it’s quite straightforward for anyone new to the tool. Next time I write about PySpark, I plan to focus more on working with it in the context of an actual project. Thanks for reading!

Basic Concepts:

Lazy Execution:

PySpark Code:

Initialization

EDA of data

Analyze text data

Refactoring what we have so far

Enjoy Reading This Article?