Building an ETL Pipeline with Apache Spark for CSV to MySQL Data Integration

3 min readJul 13, 2023

In the world of data engineering, Extract, Transform, Load (ETL) pipelines play a crucial role in moving and transforming data from various sources into a desired destination. Apache Spark, a powerful distributed computing framework, provides excellent capabilities for processing large datasets and building scalable ETL pipelines. In this article, we will explore how to use Apache Spark to read data from a CSV file, perform transformations, and load it into a MySQL database.

Prerequisites

Before we dive into the ETL pipeline implementation, make sure you have the following components set up:

Apache Spark: Install and configure Apache Spark on your machine or cluster.
MySQL: Set up a MySQL database with appropriate credentials and access rights.
MySQL JDBC driver: Download the MySQL JDBC driver and include it in your Spark environment.

Step 1: Setting up the SparkSession The first step is to create a SparkSession, the entry point for working with Spark APIs. We configure the SparkSession with the desired settings and create an instance to work with.

Step 2: Reading the CSV Data Using the SparkSession, we can read the CSV data into a DataFrame. Specify the path to the CSV file, and if the file has a header row, set the header parameter to True. Spark will infer the schema and load the data into a DataFrame.

Step 3: Data Transformation Once the CSV data is loaded, we can apply any necessary transformations to prepare it for the MySQL database. Spark provides a rich set of functions to manipulate and transform data. Use the DataFrame’s select method to select the desired columns and apply any required transformations, such as renaming columns or changing data types.

Step 4: Configuring MySQL Connection To establish a connection to the MySQL database, we need to provide the connection details, including the host, port, database name, username, and password. Make sure to replace the placeholders in the code with your MySQL configuration. Additionally, ensure that the MySQL JDBC driver is available in the classpath.

Step 5: Writing Data to MySQL Using the DataFrame’s write method, we can save the transformed data directly to the MySQL database. Specify the JDBC format, along with the connection details and table name. Additionally, set the mode parameter to determine how to handle existing data in the MySQL table (e.g., overwrite, append, or ignore).

Step 6: Stopping the SparkSession After the data is successfully written to MySQL, it’s essential to stop the SparkSession to release the allocated resources.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from getpass import getpass

# Set up SparkSession
spark = SparkSession.builder \
    .appName("CSV to MySQL ETL Pipeline") \
    .getOrCreate()

# Read CSV file
csv_path = "data/file.csv"
df = spark.read.csv(csv_path, header=True)

# Perform transformations
transformed_df = df.select(
    col("column1").alias("new_column1"),
    col("column2").alias("new_column2")
    # Add more transformations as needed
)

# Configure MySQL connection
mysql_host = "********"
mysql_port = "****"
mysql_database = "***"
mysql_table = "****"
mysql_user = "****"
mysql_password = getpass("Enter MySQL password: ")

# Write data to MySQL
transformed_df.write.format("jdbc").options(
    url=f"jdbc:mysql://{mysql_host}:{mysql_port}/{mysql_database}",
    driver="com.mysql.jdbc.Driver",
    dbtable=mysql_table,
    user=mysql_user,
    password=mysql_password
).mode("overwrite").save()

# Stop SparkSession
spark.stop()

Conclusion

By leveraging Apache Spark’s capabilities, we can build robust and scalable ETL pipelines to handle large volumes of data efficiently. In this article, we walked through the process of reading data from a CSV file, applying transformations, and loading it into a MySQL database. With Spark’s extensive functionalities and flexibility, you can customize the pipeline to suit your specific data integration requirements. ETL pipelines are vital components in modern data architectures, empowering organizations to extract insights from diverse data sources and drive data-driven decision-making processes.

Building an ETL Pipeline with Apache Spark for CSV to MySQL Data Integration

Written by Harun Mwenda Mbaabu