Spark – Dataframe Interview Questions

Here are 10 critical interview questions on Spark DataFrame operations along with their solutions:

Question 1: How do you create a DataFrame in Spark from a collection of data? #

Solution:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()

# Sample data
data = [("John", 25), ("Doe", 30), ("Jane", 28)]

# Create DataFrame
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

# Stop Spark session
spark.stop()

Question 2: How do you select specific columns from a DataFrame? #

Solution:

# Select specific columns
selected_df = df.select("name", "age")

# Show DataFrame
selected_df.show()

Question 3: How do you filter rows in a DataFrame based on a condition? #

Solution:

# Filter rows where age is greater than 25
filtered_df = df.filter(df["age"] > 25)

# Show DataFrame
filtered_df.show()

Question 4: How do you group by a column and perform an aggregation in Spark DataFrame? #

Solution:

# Sample data
data = [("John", "HR", 3000), ("Doe", "HR", 4000), 
        ("Jane", "IT", 5000), ("Mary", "IT", 6000)]

# Create DataFrame
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)

# Group by department and calculate average salary
avg_salary_df = df.groupBy("department").avg("salary")

# Show the result
avg_salary_df.show()

Question 5: How do you join two DataFrames in Spark? #

Solution:

# Sample data
data1 = [("John", 1), ("Doe", 2), ("Jane", 3)]
data2 = [(1, "HR"), (2, "IT"), (3, "Finance")]

# Create DataFrames
columns1 = ["name", "dept_id"]
columns2 = ["dept_id", "department"]

df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

# Join DataFrames on dept_id
joined_df = df1.join(df2, "dept_id")

# Show the result
joined_df.show()

Question 6: How do you handle missing data in Spark DataFrame? #

Solution:

# Sample data
data = [("John", None), ("Doe", 25), ("Jane", None), ("Mary", 30)]

# Create DataFrame
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Fill missing values with a default value
df_filled = df.fillna({'age': 0})

# Show the result
df_filled.show()

Question 7: How do you apply a custom function to a DataFrame column using UDF? #

Solution:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Define UDF to convert department to uppercase
def convert_uppercase(department):
    return department.upper()

# Register UDF
convert_uppercase_udf = udf(convert_uppercase, StringType())

# Apply UDF to DataFrame
df_transformed = df.withColumn("department_upper", convert_uppercase_udf(df["department"]))

# Show the result
df_transformed.show()

Question 8: How do you sort a DataFrame by a specific column? #

Solution:

# Sort DataFrame by age
sorted_df = df.orderBy("age")

# Show the result
sorted_df.show()

Question 9: How do you add a new column to a DataFrame? #

Solution:

# Add a new column with a constant value
df_with_new_column = df.withColumn("new_column", df["age"] * 2)

# Show the result
df_with_new_column.show()

Question 10: How do you remove duplicate rows from a DataFrame? #

Solution:

# Sample data with duplicates
data = [("John", 25), ("Doe", 30), ("Jane", 28), ("John", 25)]

# Create DataFrame
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Remove duplicate rows
df_deduplicated = df.dropDuplicates()

# Show the result
df_deduplicated.show()

These questions and solutions cover fundamental and advanced operations with Spark DataFrames, which are essential for data processing and analysis using Spark.

codeIn [Spark]

Bigdata – Knowledge Base

Pyspark

Spark Optimization

Python

SQL

Git

Hive

Unix Commands

AWS – Cloud

Spark – Dataframe Interview Questions

Question 1: How do you create a DataFrame in Spark from a collection of data? #

Question 2: How do you select specific columns from a DataFrame? #

Question 3: How do you filter rows in a DataFrame based on a condition? #

Question 4: How do you group by a column and perform an aggregation in Spark DataFrame? #

Question 5: How do you join two DataFrames in Spark? #

Question 6: How do you handle missing data in Spark DataFrame? #

Question 7: How do you apply a custom function to a DataFrame column using UDF? #

Question 8: How do you sort a DataFrame by a specific column? #

Question 9: How do you add a new column to a DataFrame? #

Question 10: How do you remove duplicate rows from a DataFrame? #

What are your feelings

codeIn [Spark]

Pyspark

HIVE

SQL

Git

Bigdata – Knowledge Base

Spark – Dataframe Interview Questions

Question 1: How do you create a DataFrame in Spark from a collection of data? #

Question 2: How do you select specific columns from a DataFrame? #

Question 3: How do you filter rows in a DataFrame based on a condition? #

Question 4: How do you group by a column and perform an aggregation in Spark DataFrame? #

Question 5: How do you join two DataFrames in Spark? #

Question 6: How do you handle missing data in Spark DataFrame? #

Question 7: How do you apply a custom function to a DataFrame column using UDF? #

Question 8: How do you sort a DataFrame by a specific column? #

Question 9: How do you add a new column to a DataFrame? #

Question 10: How do you remove duplicate rows from a DataFrame? #

What are your feelings

Share This Article: