Spark – Transformation & Action Part 1

Spark Dependencies: Narrow and Wide Transformations #

In Spark, transformations can be categorized into two types based on how they affect the data flow: Narrow dependencies and Wide dependencies. Understanding these dependencies is crucial for optimizing Spark applications and ensuring efficient data processing.

Narrow Dependency Transformations #

In a narrow dependency, each parent RDD partition is used by at most one partition of the child RDD. This means that the data flow is more straightforward and does not require shuffling. Examples include map, filter, and flatMap.

Wide Dependency Transformations #

In a wide dependency, multiple child RDD partitions may depend on the data from multiple parent RDD partitions. This often necessitates shuffling data across the network, which is more expensive in terms of time and resources. Examples include groupByKey, reduceByKey, and join.

Actions #

Actions trigger the execution of the transformations. They return a value to the driver program or write data to an external storage system. Examples of actions include collect(), count(), saveAsTextFile(), and show().

Example with Detailed Explanation #

Let’s illustrate these concepts with an example involving a sales dataset.

from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

# Sample sales data
sales_data = [
    ("2023-01-01", "ProductA", 100),
    ("2023-01-01", "ProductB", 200),
    ("2023-01-02", "ProductA", 150),
    ("2023-01-02", "ProductB", 250),
    ("2023-01-03", "ProductA", 200),
    ("2023-01-03", "ProductB", 300)
]

# Create DataFrame
columns = ["Date", "Product", "Sales"]
sales_df = spark.createDataFrame(sales_data, schema=columns)

# Perform some transformations and actions

# Narrow Dependency Transformation: filter
filtered_df = sales_df.filter(sales_df.Sales > 150)

# Narrow Dependency Transformation: map
mapped_df = filtered_df.rdd.map(lambda row: (row.Product, row.Sales))

# Wide Dependency Transformation: reduceByKey
aggregated_rdd = mapped_df.reduceByKey(lambda x, y: x + y)

# Action: collect
result = aggregated_rdd.collect()

# Print the result
print(result)

# Stop the Spark session
spark.stop()

Detailed Breakdown #

Narrow Dependency Transformations #

Filter Transformation: filtered_df = sales_df.filter(sales_df.Sales > 150)
- Dependency: Narrow
- Explanation: Each partition of the sales_df is processed independently. Rows with Sales greater than 150 are kept. This operation does not require shuffling data between partitions.
Map Transformation: mapped_df = filtered_df.rdd.map(lambda row: (row.Product, row.Sales))
- Dependency: Narrow
- Explanation: Each partition of the filtered_df is processed independently. Each row is mapped to a tuple (Product, Sales). Again, no shuffling is needed.

Wide Dependency Transformations #

ReduceByKey Transformation: aggregated_rdd = mapped_df.reduceByKey(lambda x, y: x + y)
- Dependency: Wide
- Explanation: This transformation requires shuffling data so that all records with the same Product are sent to the same partition. Each partition of the parent RDD can contribute to multiple partitions of the child RDD.

Actions #

Collect Action: result = aggregated_rdd.collect()
- Action: Collects the results from the aggregated_rdd and brings them to the driver program. This triggers the execution of the previously defined transformations.

Execution Plan #

When collect() is called, Spark creates an execution plan with the following stages:

Stage 1:
- Transformations: Filter and map
- Tasks: Operate on partitions of the original DataFrame to filter and map rows.
Stage 2:
- Transformation: ReduceByKey
- Tasks: Shuffle data to group by Product and then reduce to calculate total sales.

Summary #

Narrow Dependency Transformations (filter, map): Operate on partitions independently without shuffling data.
Wide Dependency Transformations (reduceByKey): Require shuffling data across the network.
Actions (collect): Trigger the execution of transformations and return results to the driver program.

Understanding these concepts helps optimize Spark jobs by minimizing expensive shuffling operations and effectively managing the execution plan.

codeIn [Spark]

Bigdata – Knowledge Base

Pyspark

Spark Optimization

Python

SQL

Git

Hive

Unix Commands