mlehotay

developer blog and assorted data science projects

Electricity Usage in Ontario


import pandas as pd
import numpy as np
import altair as alt
pd.options.display.max_columns = 40

Preliminary Review

I received some excel files from my father-in-law containing information about electricity consumption in Ontario. He looked up the data on the web three times daily and manually entered it into a spreadsheet. There are three files in xlsx format. File 1 is observations from 2018, file 2 is observations from 2019, and file 3 is 1000 observations starting from August 2018 (approximately 13 months).

Loaded files in LibreOffice Calc to preview data.

Most date entries are in dd-mmm format but a few are in dd-mmm-yy.

Column headers span two rows: row 1 is name and row 2 is units. Header in file 2 is an image instead of text.

File 3 contains two sheets with data. Looks like 1st sheet is a copy of the 2018 data and 2nd sheet is combined data from 2018-08-25 through 2019-09-30.

I see at least one data entry error (cell F12 in file 1).

Files 1 and 3 contain summary statisitics: Three rows of stats at the bottom of the sheet: min, max, and avg of each column. The two rightmost columns show average daily totals per month, and name of the month.

Image of spreadsheet

Data Cleaning

Fixed headers to contain both name and units in one row.

Set the format of all the date columns to yyyy-mm-dd.

Some of the numbers in the data contain commas (cell E6 and all of column R). Fixed by setting excel number format to 0.0 for percent columns and 0 for MW columns.

Deleted summary statistics.

Exported all four sheets as CSV.

Diffed the exported csv files. The extra sheet in file 3 is an unchanged copy of the data from file 1. Generated checksums to verify.

!diff data/electricity1.csv data/electricity3a.csv
!cksum data/*.csv
1070850776 26838 data/electricity1.csv
4245893866 90971 data/electricity2.csv
3911391547 96667 data/electricity3.csv
1070850776 26838 data/electricity3a.csv
df1 = pd.read_csv('data/electricity1.csv') # 2018 data
df2 = pd.read_csv('data/electricity2.csv') # 2019 data
df3 = pd.read_csv('data/electricity3.csv') # ~1000 observations starting August 2018
df4 = df1.append(df2)
df4.to_csv('data/electricity4.csv', index=False) # all data in one file
print(df1.shape)
print(df2.shape)
print(df3.shape)
print(df4.shape)
(293, 18)
(1018, 18)
(1075, 18)
(1311, 18)

Feature Selection

So now we've got all 1311 records in one data frame, df4. Let's take a look at these values and see what's up with the percentages. There were no formulas in the excel files so I'm thinking these values may have been calculated by hand, or maybe transcribed from the website. In any case they are probably redundant since we have the actual MW values. Let's check the math to make sure.

Note: bad code here. will fix later.

# something is going on with the dtypes, try forcing them to be numeric
df = pd.read_csv('data/electricity1.csv', parse_dates={'iso8601': [0, 1, 2]},
                  dtype={'Nuclear %': np.float64,
                         'Nuclear MW': np.int64,
                         'Hydro %': np.float64,
                         'Hydro MW': np.int64,
                         'Gas %': np.float64,
                         'Gas MW': np.int64,
                         'Wind %': np.float64,
                         'Wind MW': np.int64,
                         'Solar %': np.float64,
                         'Solar MW': np.float64,
                         'Biofuel %': np.object, # fix me
                         'Biofuel MW': np.float64,
                         'Nuc.+Hyd. %': np.float64,
                         'Nuc.+Hyd. MW': np.int64,
                         'Total MW': np.int64
                        })
df.dropna(inplace = True)
df[((df['Nuclear MW']/df['Total MW']) - (df['Nuclear %']/100) >= 0.01) |
    ((df['Hydro MW'] / df['Total MW']) - (df['Hydro %'] / 100) >= 0.01) |
    ((df['Gas MW'] / df['Total MW']) - (df['Gas %'] / 100) >= 0.01) |
    ((df['Wind MW'] / df['Total MW']) - (df['Wind %'] / 100) >= 0.01) |
    ((df['Solar MW'] / df['Total MW']) - (df['Solar %'] / 100) >= 0.01)
    #((df['Biofuel MW'] / df['Total MW']) - (df['Biofuel %'] / 100) >= 0.01) |
    #((df['Nuc.+Hyd. MW'] / df['Total MW']) - (df['Nuc.+Hyd. %'] / 100) >= 0.01)
]
iso8601 Nuclear % Nuclear MW Hydro % Hydro MW Gas % Gas MW Wind % Wind MW Solar % Solar MW Biofuel % Biofuel MW Nuc.+Hyd. % Nuc.+Hyd. MW Total MW
18 2018-09-01 14:45:00 57.8 9635 19.0 3171 13.0 2173 8.4 1393 1.4 209.0 0.4 69.0 76.8 12806 15554
73 2018-10-04 15:45:00 60.0 10582 21.6 3469 4.5 683 6.1 974 1.9 289.0 0.2 27.0 81.6 14051 15144
99 2018-10-16 12:45:00 63.2 10619 22.3 3751 1.1 172 11.6 1950 1.7 251.0 0.3 52.0 85.5 14370 15038
143 2018-11-08 16:45:00 60.3 10547 26.3 4605 10.5 1835 3.7 457 0.1 25.0 0.2 28.0 86.6 15152 17085
237 2018-12-12 13:45:00 55.5 9973 21.7 3909 11.3 3034 11.0 1983 0.3 57.0 0.1 28.0 77.2 13882 18899

5 rows have miscalculated percentages, but all the other rows are calculated correctly. Probably data entry errors. Now that we've verified the percentage columns are derived from the MW columns, we can safely drop them.

df5 = pd.read_csv('data/electricity1.csv', parse_dates={'iso8601': [0, 1, 2]},
                  dtype={'Nuclear MW': np.int64,
                         'Hydro MW': np.int64,
                         'Gas MW': np.int64,
                         'Wind MW': np.int64,
                         'Solar MW': np.float64,
                         'Biofuel MW': np.float64,
                         'Nuc.+Hyd. MW': np.int64,
                         'Total MW': np.int64
                        })
df = df5.dropna()
df[df['Nuclear MW'] + df['Hydro MW'] - df['Nuc.+Hyd. MW'] >= 1]
iso8601 Nuclear % Nuclear MW Hydro % Hydro MW Gas % Gas MW Wind % Wind MW Solar % Solar MW Biofuel % Biofuel MW Nuc.+Hyd. % Nuc.+Hyd. MW Total MW
220 2018-12-06 10:45:00 56.8 10758 23.7 4492 3.0 575 16.0 3034 0.3 48.0 0.2 36.0 80.5 13250 18943
df = df5[['Nuclear MW', 'Hydro MW', 'Gas MW', 'Wind MW', 'Solar MW', 'Biofuel MW', 'Total MW']]
df[df['Nuclear MW'] + df['Hydro MW'] + df['Gas MW'] +
   df['Wind MW'] + df['Solar MW'] + df['Biofuel MW'] - df['Total MW'] >= 1]
Nuclear MW Hydro MW Gas MW Wind MW Solar MW Biofuel MW Total MW
18 9635 3171 2173 1393 209.0 69.0 15554
73 10582 3469 683 974 289.0 27.0 15144
82 10684 3965 431 1567 193.0 135.0 16795
99 10619 3751 172 1950 251.0 52.0 15038
139 10509 3357 247 3427 69.0 29.0 17391
143 10547 4605 1835 457 25.0 28.0 17085
237 9973 3909 3034 1983 57.0 28.0 18899

The Nuc.+Hyd. MW and Total MW columns are also derived from the other columns, so we don't need them either.

More Data Cleaning

Need to clean up all the code above, but not tonight. Read the CSV files again instead.

df1 = pd.read_csv('data/electricity1.csv', parse_dates={'iso8601': [0, 1, 2]})
df2 = pd.read_csv('data/electricity2.csv', parse_dates={'iso8601': [0, 1, 2]})
df = df1.append(df2)

Rename the columns, select just the ones we want.

df.columns
Index(['iso8601', 'Nuclear %', 'Nuclear MW', 'Hydro %', 'Hydro MW', 'Gas %',
       'Gas MW', 'Wind %', 'Wind MW', 'Solar %', 'Solar MW', 'Biofuel %',
       'Biofuel MW', 'Nuc.+Hyd. %', 'Nuc.+Hyd. MW', 'Total MW'],
      dtype='object')
df.columns = ['Date','', 'Nuclear','', 'Hydro','', 'Gas','', 'Wind','', 'Solar','', 'Biofuel','','','']
df = df[['Date','Nuclear', 'Hydro', 'Gas', 'Wind', 'Solar', 'Biofuel']]

Deal with the missing values. It makes sense to set them all to zero.

df.isnull().sum()
Date         0
Nuclear      0
Hydro        0
Gas          0
Wind         4
Solar      827
Biofuel    596
dtype: int64
df['Wind'] = np.where(df['Wind'].isnull(), 0, df['Wind'])
df['Solar'] = np.where(df['Solar'].isnull(), 0, df['Solar'])
df['Biofuel'] = np.where(df['Biofuel'].isnull(), 0, df['Biofuel'])

Data Visualization

We are finally ready to create some awesome charts! Let's see what Altair can do with this data.

(
    alt.Chart(df)
    .mark_line()
    .encode(x='Date', y='Nuclear')
)

That is pretty nice! Can we plot another series?

(
    alt.Chart(df)
    .mark_line()
    .encode(x='Date', y='Nuclear')
    .encode(x='Date', y='Hydro')
)

Um, no. That is just showing hydro, not nuclear. I think I need to do some reading.

[Time passes...]

It looks like Altair likes long data more than wide data. Let's make it long instead.

df = pd.melt(df, id_vars='Date', var_name='Type', value_name='MW')
df.to_csv('data/electricity5.csv', index=False)
df.head()
Date Type MW
0 2018-08-25 13:45:00 Nuclear 11017
1 2018-08-25 20:45:00 Nuclear 11084
2 2018-08-26 08:45:00 Nuclear 11058
3 2018-08-26 12:45:00 Nuclear 11066
4 2018-08-26 16:45:00 Nuclear 11047

OK! I think we're ready. I've decided I want to make a stacked area chart. Let's give it a shot.

alt.Chart(df.sample(5000)).mark_area().encode(
    x="Date:T",
    y=alt.Y("MW:Q"),
    color="Type:N"
)

Well, that was not what I had in mind. I think there is too much variability from day to day. Maybe we need to fit regressions for each of the power types, or use moving averages. I'll try the moving averages first.

#df['moving'] = df.groupby('Type').transform(lambda x: x.rolling(10, 1).mean())
#df.index = pd.date_range('1/1/2018', periods=1100)
line = alt.Chart(df.sample(5000)).mark_line(
    color='red',
    size=3
).transform_window(
    rolling_mean='mean(MW)',
    frame=[-15, 15]
).encode(
    x='Date:T',
    y='MW:Q'
)

points = alt.Chart(df.sample(5000)).mark_point().encode(
    x='Date:T',
    y=alt.Y('MW:Q',
            axis=alt.Axis(title='MW'))
)

points + line

To be continued...