Today PySpark is one of the favorite choices among data engineers and data scientists. This tutorial is for someone who has already worked with pandas and looking for a fresh start or already working with PySpark. Though I personally feel pandas is way superior to pyspark in terms of functions, documentation, open-source community support. But Pyspark does all the magic when you are dealing with real-world data and data ingestion pipelines. This tutorial is not about which tool is best but to understand basic code commands differences between them.
Pyspark build over spark offers many features such as Spark SQL, DataFrames, Streaming, Machine Learning support. Pyspark uses a Spark interface in Python that helps to scale distributed environments.
When to use pySpark over Pandas ?
Dealing with large datasets in pandas can be painful due to system memory constraints while perform complex operations. PySpark can be one of the alternatives when dealing with big data. Pandas users might find the commands used in pyspark quite similar in most cases. So the question comes is why pyspark is faster than pandas? The main driver program in the spark application controls the functions and executes parallel operations on the cluster. Spark then uses data partition across the nodes in the cluster to collect the elements parallelly. Eventually, distributed computing helps pyspark to perform operations quickly and faster than pandas.
However, Pandas is still very powerful in complex calculations and performing data transformations. It has rich open source community support and good documentation which ease the developer’s life whereas pyspark is still evolving and lack documentation. To get the best of both worlds one must use both pyspark and pandas alternatively whenever required. For eg: pyspark dataframes can be easily converted to pandas dataframes.
spark_df.toPandas() #To convert Spark DataFrame to Pandas DataFrame
Pandas To PySpark Code Comparison
The below comparison includes: creating a dataframe, reading top and last records, checking datatypes, group by aggregations, joins, statistical summary, etc.
Pandas
PySpark
# Using Pandas import pandas as pd df = pd.DataFrame({ 'Name':['Ajay','Arun','Deepak'], 'Age':[29,21,34]}) # shape of a dataframe df.shape # display top 5 rows df.head(5) # display last 5 rows df.tail(5) # Atributes df.columns #Rename Columns df.rename(columns={'old'='new'}) # Drop Columns df.drop('column',axis=1) # To filter records df[df.col>50] # To fill null values df.fillna(0) # Groupby Aggragation df.groupby(['col1','col2']).agg ({'col1':'Sum','col2':'avg'}) # Summary Statistics df.describe() # DF datatype df.info() # Correlation # pearson coefficient is the only method available at this time df.corr() df['A'].corr(df['B']) # Covariance df.A.cov(df.B) # Log Transformation df['log_A'] = np.log(df.A) # Joins left.merge(right, on='Key') And left.merge(right, left_on='col1', right_on='col2')
# Using Pyspark from pyspark.sql import SparkSession df = spark.createDataFrame([ ('Ajay',29),('Arun',31),('Deepak',34)],['Name','Age']).show() # shape of a dataframe df.count(), len(df.columns) # display top 5 rows df.show(5) df.head(5) # display last 5 rows df.tail(5) # Atributes df.columns #Rename Columns df.withColumnRenamed('old'='new') # Drop Columns df.drop() # To filter records df[df.col>50] # To fill null values df.fillna(0) # Groupby Aggragation df.groupby(['col1','col2']).agg ({'col1':'Sum','col2':'avg'}) # Summary Statistics dataframe.summary().show() df.describe().show() # DF datatype df.dtypes # Correlation # pearson coefficient is the only method available at this time df.corr('A', 'B') # Covariance df['A'].cov(df['B]) # Log Transformation df.withColumn('log_A', f.log(df.A)) # Joins left.join(right,on='Key') and left.join(right, left.col1==right.col2)
One may still debate pandas vs pyspark but Pandas has lots of features such as support for time series analysis, EDA, Hypothesis test, etc which brings back us to the same topic when we should know when to use pyspark and pandas. Both support SQL functionalities which means one can easily use SQL queries just like in any relational database.
Also Read: Using MLFlow to Track your ML Model
RDD in PySpark
Spark goes around a concept called RDD which means Resilient Distributed Dataset which is used for fault-tolerant collection of elements operated in parallel. All the transformations in pyspark are lazy, which means they do not get computed until a user runs it. Pyspark should be the first choice for data engineers as it provides: In-Memory Computation, Lazy Evaluation, Partitioning, Persistence, and Coarse-grained operations.
There are two ways to create RDD:
- Parallelized collections: The elements are collected to be distributed parallely
- External Datasets: Can created distributed datasets from HDFS, local files, Amazon S3, Cassandra etc.
data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) distFile = sc.textFile("mydataset.txt")
PySpark Best Practices
- use pyspark.sql.functions and other built is functions available in PySpark for faster computaion (Runs in JVM)
- Use the same version of python and packages on the cluster as driver.
- For data visualization, Spark built in UI support from databricks (eg: Azure Databricks)
- SparkML lib for machine learning related work.
Things not to do:
- Iterate through rows, works fine in pandas but not pyspark
- df.toPandas.head() – Instead use df.limit(5).toPandas() : Memory run out issue can happen
Data Visualization in PySpark
Unlike Pandas supports many rich libraries in python such as matplotlib, seaborn, plotly, and many more. Pyspark doesn’t have any inbuilt support for data visualization. However, data scientists can still make the best use of platforms like Databricks which has an inbuilt UI for data visualization.
In General, we can take some samples of data and convert them to pandas. And then easily plot and understand the hypothesis behind the samples which can be easily interpreted for the entire population.