Time Series¶
Examples:
- Stock Prices
- Weather
- Heart Monitor Data
Components:
- Trend- Idenitify a Trend with Visualization and Smoothing techniques
- Seasonality - repeating patterns, predictable. Idenitify using decomposition
- Cyclicality - long term, irregular fluctuations that are not strictly seasonal. Ex: Economic cycle. Identify with statistical models
- Noise - Random variations in data that do not follow a pattern. Ex : water pipe break leading to extreme water usage
- Stationary - Time series where statistical properties do not chagne over time - mean, variance, autocorrelation. Identify using rolling statistics
In [3]:
import pandas as pd
data = pd.read_csv('Water_Usage_Data.csv', parse_dates = ['Date'], index_col='Date')
In [ ]:
print(data.head())
print(data.info())
print(data.describe())
total_gallons residential_gallons multi_family_gallons \ Date 2019-01-01 17240452.01 6896553.991 4936297.028 2019-01-02 20204457.43 7262534.651 4687956.808 2019-01-03 19367188.31 6214681.059 4649834.954 2019-01-04 19294498.24 6032113.704 4552227.705 2019-01-05 18073429.03 6678241.313 4914119.605 commercial_gallons industrial_gallons public_authority_gallons Date 2019-01-01 2899003.223 372027.0000 2136570.769 2019-01-02 4954364.353 538747.0000 2760854.620 2019-01-03 5051775.413 602900.0000 2847996.884 2019-01-04 5249283.403 613821.9999 2847051.425 2019-01-05 3844688.871 430945.0000 2205434.239 <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 600 entries, 2019-01-01 to 2020-08-23 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_gallons 600 non-null float64 1 residential_gallons 600 non-null float64 2 multi_family_gallons 600 non-null float64 3 commercial_gallons 600 non-null float64 4 industrial_gallons 600 non-null float64 5 public_authority_gallons 600 non-null float64 dtypes: float64(6) memory usage: 32.8 KB None total_gallons residential_gallons multi_family_gallons \ count 6.000000e+02 6.000000e+02 6.000000e+02 mean 2.540250e+07 6.709635e+06 6.665081e+06 std 5.380030e+07 1.485217e+06 3.594590e+07 min 4.506101e+05 1.000827e+05 -4.879963e+06 25% 1.970888e+07 6.087324e+06 5.105664e+06 50% 2.097685e+07 6.702186e+06 5.308363e+06 75% 2.296736e+07 7.266935e+06 5.549265e+06 max 9.176957e+08 2.578089e+07 8.850357e+08 commercial_gallons industrial_gallons public_authority_gallons count 6.000000e+02 6.000000e+02 6.000000e+02 mean 5.956399e+06 8.955720e+05 5.175812e+06 std 1.569350e+07 6.936311e+05 3.679105e+07 min 8.751938e+04 2.208200e+04 -5.895248e+06 25% 3.726428e+06 5.384380e+05 2.866413e+06 50% 5.055808e+06 6.830705e+05 3.555144e+06 75% 5.598446e+06 7.690850e+05 3.939894e+06 max 3.420817e+08 4.000079e+06 8.982956e+08
In [7]:
data.dropna(inplace=True)
data.dropna(axis = 1, inplace = True)
data.dropna(thresh=3, inplace=True)
Check for Stationary¶
from statsmodels.tsa.stattools import adfullerresult = adfuller(data[""])
Applying LOG¶
import numpy as np data["your_column"] = np.log(data["column])
Resampling - convert data to weekly or hourly data¶
Downsampling and Upsampling
Decomposition¶
In [8]:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
# Simulated time series: seasonal + trend + noise
np.random.seed(42)
period = 12 # e.g., monthly data with yearly seasonality
time = np.arange(100)
data = 0.1 * time + 2 * np.sin(2 * np.pi * time / period) + np.random.normal(0, 0.5, size=len(time))
# Create DataFrame
ts = pd.Series(data, index=pd.date_range(start="2020-01-01", periods=100, freq="M"))
# Decompose
result = seasonal_decompose(ts, model="additive", period=period)
# Plot
result.plot()
plt.suptitle("Seasonal Decomposition", fontsize=14)
plt.tight_layout()
plt.show()
C:\Users\joshu\AppData\Local\Temp\ipykernel_10448\2482162769.py:13: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead. ts = pd.Series(data, index=pd.date_range(start="2020-01-01", periods=100, freq="M"))
In [ ]: