Spark – Transformation & Action Part 2

Here’s a categorized list of common Narrow dependency transformations, Wide dependency transformations, and Actions available in Apache Spark:

Narrow Dependency Transformations #

Narrow dependency transformations are those where each partition of the parent RDD is used by at most one partition of the child RDD. These transformations do not require data shuffling.

map : rdd.map(lambda x: x * 2)
filter : rdd.filter(lambda x: x % 2 == 0)
flatMap : rdd.flatMap(lambda x: (x, x * 2))
mapPartitions : rdd.mapPartitions(lambda iter: (x * 2 for x in iter))
mapPartitionsWithIndex : rdd.mapPartitionsWithIndex(lambda index, iter: (index, x) for x in iter)
union : rdd1.union(rdd2)
sample : rdd.sample(False, 0.1)
pipe : rdd.pipe('cmd')
coalesce : rdd.coalesce(1)

Wide Dependency Transformations #

Wide dependency transformations are those where multiple child partitions may depend on multiple parent partitions. These transformations often require shuffling data across the network.

groupByKey : rdd.groupByKey()
reduceByKey : rdd.reduceByKey(lambda a, b: a + b)
sortByKey : rdd.sortByKey()
join : rdd1.join(rdd2)
cogroup : rdd1.cogroup(rdd2)
groupWith : rdd1.groupWith(rdd2)
cartesian : rdd1.cartesian(rdd2)
distinct : rdd.distinct()
repartition : rdd.repartition(10)

Actions #

Actions trigger the execution of transformations and return results to the driver program or write data to external storage.

collect : rdd.collect()
count : rdd.count()
take : rdd.take(10)
top : rdd.top(10)
takeOrdered : rdd.takeOrdered(10)
saveAsTextFile : rdd.saveAsTextFile("path/to/file")
saveAsSequenceFile : rdd.saveAsSequenceFile("path/to/file")
saveAsObjectFile : rdd.saveAsObjectFile("path/to/file")
countByKey : rdd.countByKey()
foreach : rdd.foreach(lambda x: print(x))
reduce : rdd.reduce(lambda a, b: a + b)
aggregate : rdd.aggregate((0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))
fold : rdd.fold(0, lambda x, y: x + y)
saveAsHadoopFile : rdd.saveAsHadoopFile("path/to/file")
saveAsNewAPIHadoopFile : rdd.saveAsNewAPIHadoopFile("path/to/file")

Spark DataFrame Transformations and Actions #

Spark DataFrames offer a higher-level API compared to RDDs. Here’s a categorized list of common transformations and actions for DataFrames:

Transformations #

Narrow Dependency:
- select : df.select("column1", "column2")
- filter : df.filter(df["column"] > 100)
- withColumn : df.withColumn("new_col", df["existing_col"] * 2)
- drop : df.drop("column")
- distinct : df.distinct()
- sample : df.sample(False, 0.1)
Wide Dependency:
- groupBy : df.groupBy("column").count()
- join : f1.join(df2, "column")
- union : f1.union(df2)
- repartition : df.repartition(10)

Actions #

show : df.show()
collect : df.collect()
count : df.count()
take: df.take(10)
head : df.head(10)
write : df.write.csv("path/to/file")
foreach : df.foreach(lambda row: print(row))

Summary #

Understanding the differences between narrow and wide dependencies, as well as the available actions, helps in designing efficient Spark applications. Narrow dependencies avoid the expensive cost of shuffling data, whereas wide dependencies often require shuffling and should be minimized when possible. Actions are necessary to trigger the execution of the transformations and obtain the results.

codeIn [Spark]

Bigdata – Knowledge Base

Pyspark

Python

SQL

Git

Hive