Pyspark Logging #
Introduction #
Logging is an essential aspect of any software application, including big data processing frameworks like PySpark. It helps developers to understand, debug, and monitor applications by recording events, errors, and important information. PySpark offers flexible logging capabilities using Python’s built-in logging
module and Spark’s internal logging system.
In this document, we will discuss how to configure and use logging in PySpark for both local and cluster environments.
1. Why Use Logging in PySpark? #
- Debugging: Logs can help identify the root cause of errors in distributed environments.
- Monitoring: Logs provide a history of the actions and status of your application.
- Performance Tuning: Logs help in tracking resource usage and application bottlenecks.
- Alerting: Logs can be configured to send alerts if specific events or errors occur.
2. Logging with Python’s logging
Module in PySpark #
The Python logging
module provides a flexible framework for logging messages. These messages can be directed to different output destinations like the console or files, and the level of logging can be controlled.
Example 1: Basic Python Logging in PySpark #
import logging
from pyspark.sql import SparkSession
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Logging Example").getOrCreate()
# Example log messages
logger.info("Spark session has been created.")
# Create a DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "ID"])
# Log DataFrame creation
logger.info("DataFrame has been created with %d records.", df.count())
# Perform an operation and log the result
df_filtered = df.filter(df.ID > 1)
logger.info("Filtered DataFrame with %d records.", df_filtered.count())
# Show the filtered DataFrame
df_filtered.show()
# Stop Spark session
spark.stop()
logger.info("Spark session stopped.")
Explanation: #
- The
logging
module is configured usinglogging.basicConfig()
. - Logs are generated using the
logger
object, which logs messages at different levels (info
,warning
,error
, etc.). - In this example, logs are printed to the console.
3. Logging Levels in Python’s Logging Module #
Python’s logging
module supports different logging levels, each representing the severity of the log message.
- DEBUG: Detailed information, typically of interest only when diagnosing problems.
- INFO: Confirmation that things are working as expected.
- WARNING: An indication that something unexpected happened or indicative of some problem.
- ERROR: A more serious problem that affects program execution.
- CRITICAL: A very serious error that may cause the program to stop.
Setting Logging Level: #
You can set the logging level by modifying the logging.basicConfig(level=logging.LEVEL)
configuration, where LEVEL
is one of the above logging levels.
For example, to log only errors and more critical messages:
logging.basicConfig(level=logging.ERROR)
4. Advanced Logging Configuration #
Example 2: Logging to a File #
You can configure the logger to send logs to a file by specifying a filename
in the basicConfig
.
import logging
from pyspark.sql import SparkSession
# Configure logging to a file
logging.basicConfig(filename='pyspark_logs.log', level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Logging to File").getOrCreate()
logger.info("Spark session has been created.")
# Create a DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "ID"])
logger.info("DataFrame has been created with %d records.", df.count())
# Stop Spark session
spark.stop()
logger.info("Spark session stopped.")
In this example, the log messages are saved to the pyspark_logs.log
file in the current working directory, and each log entry includes a timestamp, the severity level, and the log message.
Log Format: #
You can customize the log format using the format
argument in basicConfig
. For example, %(asctime)s
adds a timestamp, %(levelname)s
adds the logging level, and %(message)s
includes the actual log message.
5. Integrating with PySpark’s Internal Logging #
PySpark itself provides a logging framework, which can be integrated with the Python logging
module to log messages from Spark’s internals.
Example 3: Configuring PySpark’s Log4j Properties #
You can configure PySpark’s internal logging through the Log4j properties file (log4j.properties
). This file controls the verbosity of Spark’s logs in both local and cluster environments.
- For local mode, this file can be found in the
$SPARK_HOME/conf
directory. - For cluster mode (YARN, Mesos, etc.), the file can be placed on HDFS or other distributed file systems and referenced in your Spark submit script.
The default Log4j configuration is in the $SPARK_HOME/conf/log4j.properties.template
file. You can rename it to log4j.properties
and edit it to modify Spark’s log level.
# Set everything to be logged at the INFO level
log4j.rootCategory=INFO, console
# Set the default log level for Spark components
log4j.logger.org.apache.spark=INFO
log4j.logger.org.apache.hadoop=ERROR
You can include this in your PySpark script by setting the log4j
properties file using the spark-submit
option:
spark-submit --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j.properties" your_script.py
Example 4: Controlling Spark’s Log Level from Python #
Alternatively, you can control the logging level programmatically in Python without editing log4j.properties
.
spark.sparkContext.setLogLevel("WARN")
This will set the logging level of PySpark’s internal logs to WARN
.
6. Logging in PySpark Cluster Mode (YARN) #
When running PySpark applications on a YARN cluster, logs from both the driver and the executors are collected by the YARN ResourceManager. These logs are accessible via the ResourceManager UI.
Accessing YARN Logs #
- Submit your PySpark job:bashCopy code
spark-submit --master yarn your_script.py
- Access the ResourceManager web UI (usually on port
8088
). - Find your application in the list and click on the link to view logs.
These logs contain messages from both PySpark and the Python logging
module.
Example 5: Logging Configuration for Cluster Mode #
import logging
from pyspark.sql import SparkSession
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize SparkSession with cluster mode settings
spark = SparkSession.builder \
.appName("PySpark Cluster Logging") \
.config("spark.executor.memory", "1g") \
.config("spark.executor.instances", "2") \
.getOrCreate()
logger.info("Spark session created in cluster mode.")
# Create a DataFrame
data = [("Alice", 1), ("Bob", 2)]
df = spark.createDataFrame(data, ["Name", "ID"])
logger.info("DataFrame created with %d records.", df.count())
# Stop Spark session
spark.stop()
logger.info("Spark session stopped.")
7. Summary of Best Practices #
- Use Python’s
logging
module for logging custom events, and integrate it with PySpark’s logging framework for Spark-specific events. - Configure log levels properly. Use
INFO
for general operations andDEBUG
when troubleshooting. - For cluster environments, ensure logs are captured centrally (e.g., via YARN logs) to monitor the entire distributed system.
- Avoid excessive logging in high-volume operations to minimize performance impacts.