Brief Guide To Pandas For Time Series Analysys
A series of data is taken in a timely manner. For example total sales at a showroom every day, daily billing data for a service like Apple Music, hourly readings of a telemetry device, and so on.
Pandas is a powerful Python library that provides various functionalities for time series analysis. Here are some key features and techniques in pandas that you can use for time series analysis:
Loading and Manipulating Time Series Data: Pandas provides convenient tools for reading time series data from various formats (CSV, Excel, SQL databases, etc.) and transforming it into a pandas DataFrame with a datetime index. You can use functions like read_csv()
, read_excel()
, and read_sql()
to load the data. Once the data is loaded, you can manipulate it using pandas’ indexing, slicing, filtering, and resampling capabilities.
import pandas as pd # Load time series data data = pd.read_csv('data.csv') # Convert a column to datetime index data['datetime_column'] = pd.to_datetime(data['datetime_column']) data.set_index('datetime_column', inplace=True) # Accessing and slicing time series data subset = data['start_date':'end_date']
Handling Missing Values: Time series data often contains missing values that need to be handled before analysis. Pandas provides functions like dropna()
, fillna()
, and interpolate()
to handle missing data by dropping the rows, filling with specific values, or interpolating based on neighboring values.
# Drop rows with missing values data = data.dropna() # Fill missing values with a specific value data = data.fillna(0) # Interpolate missing values data = data.interpolate()
Resampling and Frequency Conversion: Pandas allows you to resample time series data to different frequencies (e.g., downsampling or upsampling) using functions like resample()
and asfreq()
. This can be useful for aggregating or interpolating data at different time intervals.y
# Resample data to monthly frequency monthly_data = data.resample('M').mean() # Upsample data to a higher frequency using interpolation upsampled_data = data.resample('D').interpolate()
Time Series Forecasting
Forecasting and predicting the sales/values/outcomes based on historical data points.
Helpful Methods in Pandas for Time-Series
shift() – Shift index by desired number of dates/periods. It can work as LAG and LEAD in SQL.
## points shifted by -3 or lagged by 3 df['Lag_by_3'] = df.shift(-3) # points shifted by 3 or lead by 3 df['Lead_by_3'] = df.shift(3)
rolling() – for any calculation related to the rolling window i.e. rolling/moving average, moving standard deviation, etc.
This is used to get the rolling/moving average, mean, median, % changes, and sum on a particular window.
Example: Rolling Sum of Sales with window of 7 days. Here for day d, we will take the sum sales data from d-6 and add it to d.
The rolling sum of 7 days window at day d = Sum(Sales from d-6 to d)
# get rolling sum with window of 7 days (given we've day-wise data)
df.rolling(7).sum()
resample() – for upsampling or downsampling of time-series data.
# get quarterly mean from our time series
quarterly_df = df.resample('Q').mean()
# it will show 4 rows of data for year 2017 if Date column is used as index
quarterly_df['2017']
diff() – get the discrete difference of data points over the given axis (columns, periods, etc.)
# it will calculate difference of data points with period=1
df['Sales'].diff()
You can find examples of these methods in this Kaggle Notebook:
Further readings and applications
If we don’t have more data for our time series, we can use Pandas and related eco-system. Every business has some kind of time-series data and we have just scratched the surface of time-series with simple exploratory data analysis. We have a lot of advanced libraries like Prophet that the industry is using very heavily for time-series analysis and forecasting. Stay tuned for more on that.