← back to blog

Energy production and changing weather in Denmark

21 January 2025

One of the main reasons I maintain this blog is because I like delving into methods and topics that I don’t get the chance to during my day job. One of those topics is the energy market. As energy markets move towards more sustainable sources of production, there are some really interesting challenges arising that offer for some fun.

So as always, I decided to find some data and do some data engineering with it. I think I’ll split this article up in two parts, one for the data collection and engineering part and one for the quantitative analysis. I’ve decided to focus on the Danish market today. Both because I look at Norway already quite a bit, and Denmark has a nice amount of data available that are fun to play around with too. I’ve decided to focus on three main areas: energy production, weather and climate, and energy prices. There’s data available too for energy consumption and energy import/export too but it seems out of scope. If I get to it I’ll do some data analysis and create some machine learning models, but those are perhaps not as exciting since they are commonly used in general tutorials. So perhaps I’ll put that in a next blogpost so I can see if I can make something interesting out of it. First we need to do the data collection and build a data pipeline. So let’s collect some data!

Energy production

So we’ll stick to the usual suspects, the requests and json packages are essential to work with APIs. We’ll store the data in a structured tabular format (hence pandas) in a SQLite database (hence sqlite3). Some of the APIs require API keys, which we’ll store in a .env file (hence dotenv).

import os
import requests
import json
import pandas as pd
import sqlite3
from dotenv import load_dotenv

load_dotenv()

Since we’ll use the requests features a few times to extract JSON-formatted data and convert it into tabular data, we’ll define a function that takes an API call through a URL, specify the field we want to parse, and define the field in the JSON that contains the data we’re interested in. Some data providers require that you the user agent of the request. While it doesn’t hurt to include it when it’s not required, I’d like to keep it simple when possible, so when it’s not required we’ll simply use the requests.get() function, otherwise we’ll use the requests.Session() functionality and supply the headers. For this I simply copied the default one from my setup (Mac OS with Safari). For a list of alternative user agents see useragents.io.

Then we can use pandas’ json_normalize() function with the specified field to convert the JSON format to a tabular format compatible with pandas. Then we’ll also clean the column names using a function that kinda floats with me from project to project since it’s widely applicable and useful.

def request_to_pd(url, data_field, specify_headers=False):
    """
    Get the result of a requests call in a data frame
    """
    if specify_headers:
      session = requests.Session()
      session.headers.update(
        {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " + \
        "AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.1.1 Safari/605.1.15"}
      )
      response = session.get(url)
    else:
      response = requests.get(url)
    result = response.json()
    df = pd.json_normalize(result[data_field])
    df = clean_column_names(df)
    return df

def clean_column_names(df):
    """Clean column names"""
    df.columns = (
      df.columns.str.lower()
      .str.replace(" ", "_")
      .str.replace(".", "_")
    )
    return df

Now we can use this function to extract some data about energy production. For this we will use the API from a website called Energi Data Service who maintain a really neat API. By default it only returns a limited number of records, but we can override it by supplying a “limit” parameter to the API call and set it to 5 million. This data has a very high resolution at a data point per hour per municipality in Denmark, so it’ll be a big dataset. Be warned. Before storing the dataset to the SQLite database we first want to parse the date columns so it’ll be easier to work with later, so we’ll format those. And then we’ll store them in the database. Let’s also look at number of missing values in the data frame.

I’ll rerun this analysis regularly while writing, hence the if_exists="replace"
df_production = request_to_pd(
  "https://api.energidataservice.dk/dataset/ProductionMunicipalityHour?limit=5000000",
  "records"
)
df_production["municipalityno"] = df_production["municipalityno"].astype(int)
for dt_col in ["hourutc", "hourdk"]:
    df_production[dt_col] = pd.to_datetime(df_production[dt_col], utc=True)
df_production["measurement_time"] =  (
  pd.to_datetime(df_production["hourutc"])
  .dt.strftime("%Y-%m-%d %H:%M")
)

conn = sqlite3.connect(os.path.join("data", "energy_data.db"))
df_production.to_sql(
  name="production",con=conn,
  if_exists="replace", index=False
)

df_production.isna().sum()
hourutc                          0
hourdk                           0
municipalityno                   0
solarmwh                     33846
offshorewindlt100mw_mwh    2696560
offshorewindge100mw_mwh    2736359
onshorewindmwh              131328
thermalpowermwh             429406
measurement_time                 0
dtype: int64

The dataset above contains some information about the municipality, specifically the municipality number or code. I don’t live in Denmark, so the municipality code itself doesn’t tell me much (I’d be impressed if any Danish person could accurately identify the municipality by their code but I’m sure they are out there). Instead, we can download some relevant information from Danmarks Statistikk (Statistics Denmark), specifically their table on regions, provinces and municipalities. This website doesn’t have an API, but we can download a good-old CSV file instead and read that into the database instead.

df_municipality = pd.read_csv(
  os.path.join("data", "csv_da.csv"),
  delimiter=";"
)
df_municipality = clean_column_names(df_municipality)
df_municipality = df_municipality.loc[
  df_municipality["niveau"] == 3, ["kode", "titel"]
]
df_municipality.rename(
  columns={"kode": "code", "titel": "name"},
  inplace=True
)
df_municipality.to_sql(
  name="municipality", con=conn,
  if_exists="replace", index=False
)

So let’s combine the production data with the municipality names.

df_production_munic = pd.read_sql(
  """
  SELECT * FROM production AS pr
  LEFT JOIN municipality AS mu ON pr.municipalityno = mu.code
  """,
  con=conn
)

Weather and climate data

Cool, we’re not done yet, not by a long shot. I think we should get even more data. Here I’d like to particularly focus on data relevant for the more volatile energy production sources, wind and solar. An obvious dataset to look at for these energy sources is of course weather. The wind turbines don’t turn if there’s no wind, and the solar panels aren’t exposed to sunlight if there’s a thick cloud cover.

See here for an overview of all variables available

Both those features are also available from yet another Danish data registry. This time we’ll use the API from Danmarks Meteorologiske Institut (DMI) which maintains an open database. It contains a variety of categories, like meteorological observations, climate data, radar data, and forecast data. Here we’re particularly interested in historical climate data, so we’ll create a user and get our API key.

Be warned that the data frame is quite big. The raw JSON is about 250MB large for 300 000 records

The API key is essential for the next part. I’ve saved mine as usual in a .env file which I’ve excluded from Git versioning for obvious reasons. To call the API I’ll define yet another function which wraps around the initial request_to_pd() function and deal with some of the other data cleaning that comes with this particular API and the columns. We’ll run this function with the API call twice, once for wind speed (wind_speed) and once for cloud cover (cloud_cover). This will give us an hourly overview of those two features going back 300 000 records.

def get_weather_data(feature):
  url = f"https://dmigw.govcloud.dk/v2/climateData/collections/municipalityValue/items?parameterId=mean_{feature}&limit=300000&api-key={os.getenv('climateAPI')}"
  df_wthr = request_to_pd(url, "features")
  df_wthr.rename(columns={"properties_value": feature}, inplace=True)
  df_wthr.columns = [x.replace("properties_", "") for x in df_wthr.columns]
  df_wthr["lon"] = df_wthr["geometry_coordinates"].apply(lambda x: x[0])
  df_wthr["lat"] = df_wthr["geometry_coordinates"].apply(lambda x: x[1])
  df_wthr.drop(columns=["geometry_coordinates"], inplace=True)
  df_wthr["municipalityid"] = df_wthr["municipalityid"].astype(int)
  df_wthr["calculatedat"] = pd.to_datetime(df_wthr["calculatedat"], format="ISO8601").dt.floor('h')
  df_wthr["measurement_time"] = df_wthr["calculatedat"].dt.strftime("%Y-%m-%d %H:%M")

  df_wthr.to_sql(name=feature, con=conn, if_exists="replace", index=False)

  return df_wthr


df_weather_wind_speed = get_weather_data("wind_speed")
df_weather_cloud_cover = get_weather_data("cloud_cover")

Now we have some nice data to work with. Let’s put it all together. we can do this in Python by first loading all the data, but I think views in SQL are sometimes more efficient since you don’t first need to load the data into memory but instead let the database handle it, which is more efficient than doing the merging in Python or R. In case it already exists, we’ll drop the view and create a new one since SQLite doesn’t have a CREATE OR REPLACE syntax.

cursor = conn.cursor()
cursor.execute("DROP VIEW IF EXISTS production_weather;")
cursor.execute(
  """
  CREATE VIEW production_weather AS 
    SELECT
      pr.*,
      mu.name,
      ws.wind_speed,
      cc.cloud_cover
    FROM production AS pr
    LEFT JOIN municipality AS mu ON
      pr.municipalityno = mu.code
    LEFT JOIN wind_speed AS ws ON
      pr.measurement_time = ws.measurement_time
      AND pr.municipalityno = ws.municipalityid
    LEFT JOIN cloud_cover AS cc ON
      pr.measurement_time = cc.measurement_time
      AND pr.municipalityno = cc.municipalityid
  """
)
plotnine just doesn’t have all the functionality {ggplot} does

Okay, now let’s move to R for the next bit. I work mostly in Python and SQL, but when it comes to data visualization I still believe {ggplot} is the best tool for the job. We’ll use {ggtext} as usual for some of the manipulations of text.

library(tidyverse)
library(ggtext)

For the interactions with the database we’ll use the default {DBI} package, and to access the SQLite driver we’ll use the {RSQLite} package. Then we can for example have a look at what the production_weather table looks like.

conn <- DBI::dbConnect(
  RSQLite::SQLite(),
  dbname = "./data/energy_data.db"
)
data <- DBI::dbReadTable(conn, "production_weather")
data |>
  glimpse()
Rows: 3,543,694
Columns: 12
$ hourutc                 <chr> "2025-01-12 04:00:00+00:00", "2025-01-12 04:00…
$ hourdk                  <chr> "2025-01-12 05:00:00+00:00", "2025-01-12 05:00…
$ municipalityno          <int> 101, 147, 151, 153, 155, 157, 159, 161, 163, 1…
$ solarmwh                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ offshorewindlt100mw_mwh <dbl> 10.6365, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3…
$ offshorewindge100mw_mwh <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ onshorewindmwh          <dbl> 3.705670, 0.000000, 0.000000, 0.000000, 0.0000…
$ thermalpowermwh         <dbl> 176.37371, NA, 0.00000, 0.00002, NA, NA, 0.000…
$ measurement_time        <chr> "2025-01-12 04:00", "2025-01-12 04:00", "2025-…
$ name                    <chr> "København", "Frederiksberg", "Ballerup", "Brø…
$ wind_speed              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cloud_cover             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

And now we can make a simple plot. This is by no means particularly interesting, but I just wanted to show the whole process from data extraction to analysis. We’ll do a few more of these later on. In this case I’m particularly interested in the average wind speed across the entire country (per day in m/s) and the average energy production for that day nationally (in MWh).

data |>
  mutate(
    measurement_day = floor_date(as.Date(hourutc), unit = "days"),
  ) |>
  group_by(measurement_day) |>
  summarise(
    mean_wind_speed = mean(wind_speed, na.rm = TRUE),
    mean_wind_production = mean(onshorewindmwh, na.rm = TRUE)
  ) |>
  ggplot(aes(x = mean_wind_speed, y = mean_wind_production)) +
  geom_point(
    size = 4, alpha = 0.5,
    color = "#0F7269", stroke = 0
  ) +
  geom_smooth(
    color = "grey20", fill = "#0F7269",
    alpha = 0.25, method = "gam"
  ) +
  labs(
    title = "Correlation between wind speed<br>
    and wind power production",
    x = "Mean wind speed (_m/s_)",
    y = "Mean on-shore wind production (_MWh_)"
  ) +
  scale_x_continuous(breaks = seq(0, 12, by = 2)) +
  theme_minimal() +
  theme(
    plot.title.position = "plot",
    plot.title = element_markdown(face = "bold", size = 18),
    axis.title = element_markdown(face = "bold")
  )
The statistics educators can burn me on the stake for this one

It makes sense that there’s a correlation here obviously, since the wind makes the turbines turn. Correlation doesn’t imply causation, unless it does. For anyone who’s ever been to Denmark the wind speeds look a bit low. 10 m/s is about a 5 on the Beaufort scale, classified as a “fresh breeze”. Especially at the coast (where most of the wind turbines are) there is often a bit more than a “fresh breeze”, but remember that these numbers are averaged across the country.

You can check which layers are available using sf::st_layers()

Let’s make some maps, those are always fun. I downloaded a map of Denmark in a GeoPackage format from gadm.org and load it into R using the st_read() from the {sf} package. Since our data is already on a municipality level, we’ll load the relevant layer from the file.

denmark_sf <- sf::st_read(
  "data/gadm41_DNK.gpkg",
  layer = "ADM_ADM_2"
) |>
  janitor::clean_names()

We didn’t just get the geospatial data because it’s fun, we want to do something with it. In this case, make a plot. One could also get some features to use in modeling, like proximity to the sea from these kinds of datasets, which is fun to calculate, but that is outside the scope of this post. But here comes the infamous saying: maybe in a future post. Instead, we’ll go straight to the plotting of the map. We can calculate the average wind speed and wind turbine production per day. Since continuous variables don’t work very well on a single map (we could create a GIF but it’s not a great alternative here either IMO). So let’s commit another cardinal sin and categorize the continuous variable of wind speed. That way we can plot the bins in separate maps and show the average wind energy production across those bins.

data_agg_wind <- data |>
  drop_na(wind_speed) |>
  mutate(
    measurement_day = floor_date(as.Date(hourutc), unit = "days"),
  ) |>
  group_by(measurement_day, name) |>
  summarise(
    mean_wind_speed = mean(wind_speed, na.rm = TRUE),
    mean_wind_production = mean(onshorewindmwh, na.rm = TRUE)
  ) |>
  mutate(
    wind_speed_bins = cut(
      mean_wind_speed,
      breaks = c(0, 3, 6, 9, Inf)
    )
  ) |>
  group_by(wind_speed_bins, name) |>
  summarise(
    mean_onshorewindmwh = mean(mean_wind_production)
  ) |>
  inner_join(denmark_sf, by = c("name" = "name_2"))

denmark_sf |>
  ggplot(aes(geometry = geom)) +
  geom_sf() +
  geom_sf(
    data = data_agg_wind |> drop_na(wind_speed_bins),
    aes(fill = mean_onshorewindmwh)
  ) +
  labs(
    title = "Wind energy production (_MWh_) by average wind speed (_m/s_)",
    fill = "Wind energy production (_MWh_)"
  ) +
  scale_fill_continuous(
    na.value = "transparent",
    guide = guide_colorbar(
      barwidth = 15, barheight = 0.75
    )
  ) +
  facet_wrap(~wind_speed_bins, nrow = 1) +
  coord_sf(xlim = c(NA, 13)) +
  theme_void() +
  theme(
    plot.title = element_markdown(
      face = "bold",
      margin = margin(b = 10)
    ),
    strip.text = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_markdown(),
    legend.title.position = "top"
  )

Looks like there is at least a vague trend here. That with increasing wind speed comes increased power production from wind turbines seems not so surprising, but now we can also show that this is mostly true for the municipalities in the north and east of Denmark close to the North Sea.

We can do the same thing for solar power and cloud cover that we extracted earlier. The data visualization issue here is that we have two variables that are inversely correlated, with increased cloud cover the production from solar cells goes down. That’s why I calculate the bins like above for the same reason, and then I calculate the relative change in solar power generation compared to the previous bin. That also means that we need to take a reference point, otherwise the first bin has nothing to compare to. So we’ll plot the reduction in production relative to the previous bin. This means that a uniformly colored plot means a steady reduction of e.g. 5MWh across the subplots means that at 80% to 100% of cloud cover means 4 * 5MWh = 20MWh of reduction compared to the situations with 0% to 20% cloud cover.

Show code
data_agg_solar <- data |>
  drop_na(cloud_cover) |>
  mutate(
    cloud_cover_bins = cut_interval(cloud_cover, length = 20)
  ) |>
  group_by(cloud_cover_bins, name) |>
  summarise(
    mean_solarmwh = mean(solarmwh)
  ) |>
  group_by(name) |>
  arrange(cloud_cover_bins) |>
  mutate(
    solar_diff = mean_solarmwh - lag(mean_solarmwh)
  ) |>
  inner_join(denmark_sf, by = c("name" = "name_2"))

denmark_sf |>
  ggplot(aes(geometry = geom)) +
  geom_sf() +
  geom_sf(
    data = data_agg_solar |> drop_na(solar_diff),
    aes(fill = solar_diff)
  ) +
  labs(
    title = "Relative reduction in solar power<br>
    generation with increasing cloud cover",
    subtitle = "Change in production relative to previous bin",
    fill = "Relative change in output"
  ) +
  scico::scale_fill_scico(
    palette = "lajolla",
    na.value = "transparent",
    labels = scales::label_number(suffix = " MWh"),
    guide = guide_colorbar(
      barwidth = 15, barheight = 0.75,
      title.position = "top", reverse = TRUE
    )
  ) +
  facet_wrap(~cloud_cover_bins, nrow = 1) +
  coord_sf(xlim = c(NA, 13)) +
  theme_void() +
  theme(
    plot.title = element_markdown(face = "bold"),
    plot.subtitle = element_markdown(margin = margin(b = 10)),
    strip.text = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_markdown(),
    legend.title.position = "top"
  )

Here we can see that not all municipalities are colored pale yellow, but instead have a more intense darker yellow/light orange color, meaning that the solar power production is quite consistently reduced with increasing cloud cover, as we expected.

Length of day

There’s actually several, but major is a relative term

There is one more major factor playing a role in solar production, which is the length of the day. Denmark is a Nordic country, which are located higher up in the northern hemisphere (duh). This means that the difference between the length of day in the summer and winter vary a lot more widely than closer to the equator (see my blogpost about that topic). We can look at the association between the length of day and solar power production next. Let’s first download some data. The Norwegian meteorological institute (MET) has a very nice API where we can get a number of variables related to sunrise and sunset. See the example below of what this API can provide.

df_met = request_to_pd(f"https://api.met.no/weatherapi/sunrise/3.0/sun?lat=56.2639&lon=9.5018&date={pd.to_datetime('now').strftime('%Y-%m-%d')}&offset=+01:00", "properties", specify_headers=True)
print(df_met.iloc[0])
body                                                      Sun
sunrise_time                           2025-01-21T08:36+01:00
sunrise_azimuth                                        126.02
sunset_time                            2025-01-21T16:31+01:00
sunset_azimuth                                         234.15
solarnoon_time                         2025-01-21T12:33+01:00
solarnoon_disc_centre_elevation                         14.01
solarnoon_visible                                        True
solarmidnight_time                     2025-01-21T00:33+01:00
solarmidnight_disc_centre_elevation                     -53.7
solarmidnight_visible                                   False
Name: 0, dtype: object

We’re not just interested in a single day, we’d like to get historical data as well, so we can simply put our custom request_to_pd() function in a loop starting on the 1st of January 2024 to today. We’ll do some simple parsing of date columns and then calculate the day length by substracting the sunrise time from the sunset time and converting it to seconds to get a nice numerical value. Obviously, we’ll store this one in the database as well.

df_sun = pd.DataFrame()
for day in pd.date_range("2024-01-01", pd.to_datetime("now").strftime("%Y-%m-%d")):
  df_tmp = request_to_pd(
    f"https://api.met.no/weatherapi/sunrise/3.0/sun?lat=56.2639&lon=9.5018&date={day.strftime('%Y-%m-%d')}&offset=+01:00",
    "properties",
    specify_headers=True
  )
  df_sun = pd.concat([df_sun, df_tmp])

df_sun["date"] = pd.to_datetime(df_sun["sunrise_time"]).dt.date
df_sun["day_length"] = pd.to_datetime(df_sun["sunset_time"]) - pd.to_datetime(df_sun["sunrise_time"])
df_sun["day_length"] = df_sun["day_length"].dt.total_seconds()

df_sun.to_sql(
  name="day_length",con=conn,
  if_exists="replace", index=False
)

We’ve done some plots now, so instead let’s just do a simple linear model (yes, with all the data and features we have available now we can do a lot more exciting stuff but I did the thing where I was too enthousiastic and wrote a blogpost that was too long already, so let’s do the exiciting stuff next time). Let’s select the relevant columns from the database and merge it with the dataset we already collected earlier. Since the length of day data is at a different granularity than the original data we need to aggregate by day and then we can run the model of solar energy production by cloud cover with day length as a covariate. We’ll get an estimate for each predictor.

data_sun <- DBI::dbGetQuery(
  conn,
  "SELECT date, day_length FROM day_length"
) |>
  mutate(
    date = as.Date(date)
  )

data_solar_day_length <- data |>
  mutate(
    measurement_day = floor_date(as.Date(hourutc), unit = "days"),
  ) |>
  group_by(measurement_day) |>
  summarise(
    mean_solarmwh = mean(solarmwh, na.rm = TRUE),
    mean_cloud_cover = mean(cloud_cover, na.rm = TRUE)
  ) |>
  drop_na() |>
  inner_join(data_sun, by = c("measurement_day" = "date"))

data_solar_day_length |>
  lm("mean_solarmwh ~ mean_cloud_cover + day_length", data = _) |>
  summary()
Call:
lm(formula = "mean_solarmwh ~ mean_cloud_cover + day_length", 
    data = data_solar_day_length)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1706 -0.7375  0.1798  0.9025  3.4142 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.094e+00  4.232e-01  -4.948 1.16e-06 ***
mean_cloud_cover -2.424e-02  3.473e-03  -6.978 1.50e-11 ***
day_length        1.606e-04  6.509e-06  24.668  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.442 on 351 degrees of freedom
Multiple R-squared:  0.757, Adjusted R-squared:  0.7556 
F-statistic: 546.8 on 2 and 351 DF,  p-value: < 2.2e-16

Looks like both of the features are very strongly associated with the solar power production. Which again is not very surprising given what we know about solar power and how it works.

Energy storage and prices

Note that this is supplier energy prices, not consumer prices

We’re not done yet collecting data. One more feature we could download is the energy prices. This is particularly interesting when we want to make inferences about how energy prices are related to the climate and energy. We could sign up for an account so we can download the individual Excel reports, but it’s probably easier just to parse the data from the website directly. For this I wrote a simple script that loops through the days we’re interested in and get the average price across the day for the relevant areas (DK1 and DK2). The data currence is DKK per MWh, but can be specified. Again, we’ll store this data in the database too. Since it fairly long, I’ll collapse it by default.

See the Python script for downloading the price data
"""
Get the average price per day
"""

import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import sqlite3


def get_driver():
    """
    Get the driver
    """

    chrome_options = Options()
    chrome_options.add_argument("--headless=new")
    driver = webdriver.Chrome(options=chrome_options)

    return driver


def get_averages_prices(driver, date):
    """
    Get the average price per day
    """

    url = f"https://data.nordpoolgroup.com/auction/day-ahead/prices?deliveryDate={date}&currency=DKK&aggregation=DeliveryPeriod&deliveryAreas=DK1,DK2"
    driver.get(url)

    time.sleep(1)
    html_day = driver.page_source

    soup = BeautifulSoup(html_day, "html.parser")
    summary_divs = soup.findAll(
        "div", {"class": "dx-datagrid-summary-item dx-datagrid-text-content"}
    )

    avg_dk1 = float(summary_divs[-4].text.replace(",", ".").replace("\xa0", ""))
    avg_dk2 = float(summary_divs[-1].text.replace(",", ".").replace("\xa0", ""))

    result_dict = {"date": date, "avg_dk1": avg_dk1, "avg_dk2": avg_dk2}

    return result_dict


def get_list_of_prices(driver, quiet=False):
    """
    Get a data frame of average energy prices
    """
    dates_oi = pd.date_range(
        start=pd.to_datetime("now") - pd.Timedelta("1d") - pd.Timedelta("61d"),
        end=pd.to_datetime("now") - pd.Timedelta("1d"),
    )
    dates_oi_str = [x.strftime("%Y-%m-%d") for x in dates_oi]

    df_prices = pd.DataFrame()
    for doi in dates_oi_str:
        if not quiet:
            print(f"Collecting for date: {doi}")
        df_tmp = pd.DataFrame([get_averages_prices(driver, date=doi)])
        df_prices = pd.concat([df_prices, df_tmp if not df_tmp.empty else None])

    return df_prices


def main():

    driver = get_driver()

    df_prices = get_list_of_prices(driver)

    conn = sqlite3.connect(os.path.join("data", "energy_data.db"))

    df_prices.to_sql(name="energy_prices", con=conn, if_exists="replace", index=False)


if __name__ == "__main__":
    main()

Now that we have this, we can also download yet another dataset from the Energi Data Service about gas storage use. Gas storage in Western Europe is typically used to maintain grid stability when energy production from other sources is limited. Gas storage facilities in Europe are mandated to be at least 90% full on November 1st of each year and gas can be tapped throughout the winter months when energy demands are higher and production is lower. Typically when gas storage is tapped it’s when there’s high demand and low production, meaning the energy prices are most likely higher in that period too. We can download this data the same way we did before.

df_storage = request_to_pd(
  "https://api.energidataservice.dk/dataset/StorageUtilization?limit=5000000",
  "records"
)

df_storage.to_sql(
  name="gas_storage", con=conn,
  if_exists="replace", index=False
)

So let’s combine the price, weather, length of day, and production data we have now. This time, let’s do the fun stuff in SQL instead and get an overview of all the relevant variables in one table for the dates we have it available. We still need to manually convert the character date string from SQL in R to get it as a date type.

data_prices_production <- DBI::dbGetQuery(
  conn,
  "
  WITH avg_production AS (
    SELECT
    strftime('%Y-%m-%d', measurement_time) AS date,
    AVG(onshorewindmwh) AS avg_wind,
    AVG(solarmwh) AS avg_solar
    FROM production_weather
    GROUP BY date
  )
  SELECT
    ep.date,
    (ep.avg_dk1 + ep.avg_dk2) / 2 AS price,
    ap.avg_wind AS wind_production,
    ap.avg_solar AS solar_production,
    gs.withdrawedtotal / 1e3 AS gas_withdrawal_kwh,
    dl.day_length
  FROM energy_prices AS ep
  INNER JOIN avg_production AS ap ON ep.date = ap.date
  INNER JOIN gas_storage AS gs ON
    ep.date = strftime('%Y-%m-%d', gs.gasday)
  INNER JOIN day_length AS dl ON ep.date = dl.date
  "
) |>
  mutate(
    date = as.Date(date)
  )

Now we can look at the association between the energy prices and the energy sources (wind, solar, and gas storage). I want to create a plot but the variables are in different units (solar and wind are in MWh, gas storage is in kWh, and energy price in DKK). So we’ll commit yet another no-no and use a double y-axis plot (all rules can be broken if you know how to, fight me). We’ll display the energy sources on the left y-axis (in MWh or kWh for visiblity reasons) and price on the right (in DKK). This way we can look at the overlap between the different lines. I’ve highlighted a few timeperiods that stood out to me that seemed interesting using annotate().

data_prices_production |>
  pivot_longer(
    -c(date, price, day_length),
    names_to = "energy_source"
  ) |>
  add_row(energy_source = "price") |>
  mutate(
    energy_source = case_when(
      energy_source == "price" ~ "Energy price (DKK)",
      str_detect(energy_source, "solar") ~ "Solar power (MWh)",
      str_detect(energy_source, "wind") ~ "Wind power (MWh)",
      str_detect(energy_source, "withdrawal") ~ "Gas storage (kWh)"
    )
  ) |>
  ggplot(aes(x = date, color = energy_source)) +
  annotate(
    "rect",
    xmin = c(as.Date("2024-11-22"), as.Date("2024-12-08")),
    xmax = c(as.Date("2024-11-27"), as.Date("2024-12-15")),
    ymin = rep(-Inf, 2), ymax = rep(Inf, 2), alpha = 0.2
  ) +
  geom_hline(yintercept = 0, linewidth = 1) +
  geom_path(aes(y = value), linewidth = 1.5, key_glyph = "point") +
  geom_path(
    aes(y = price / 10),
    linewidth = 1.5,
    color = ggthemes::tableau_color_pal()(1),
  ) +
  labs(
    title = "Energy price and energy sources",
    color = NULL,
    x = NULL
  ) +
  scale_y_continuous(
    name = "Energy generated/released",
    expand = expansion(add = c(5, 20)),
    sec.axis = sec_axis(
      transform = ~ . * 10,
      name = "Price per MWh",
      labels = scales::label_currency(
        prefix = "", suffix = " DKK", big.mark = " "
      )
    )
  ) +
  ggthemes::scale_color_tableau(
    guide = guide_legend(override.aes = list(size = 4))
  ) +
  theme_minimal() +
  theme(
    plot.title.position = "plot",
    plot.title = element_markdown(face = "bold", size = 18),
    legend.position.inside = TRUE,
    legend.position = c(0.75, 0.8),
    legend.background = element_rect(
      fill = "white", linewidth = 0.2
    ),
    axis.title.y.left = element_markdown(
      size = 14, margin = margin(r = 5)
    ),
    axis.title.y.right = element_markdown(
      size = 14, margin = margin(l = 5)
    )
  )

It seems that even though gas storage use is only a fraction of wind and solar production (kWh vs. MWh) there seems to be a clear association between the price and gas storage use. This makes sense because the grid operators only want to access gas storage when it’s really needed, so high demand (which means high prices) means a higher likelihood that the tap is opened at the gas storage facilities.

We can confirm this again using a simple linear model like we did above (once again, the more advanced and fun stuff will have to wait for later, this blogpost is way to long already) but this topic is fun.

data_prices_production |>
  lm(
    "price ~ gas_withdrawal_kwh + wind_production +
    solar_production*day_length",
    data = _
  ) |>
  summary()
Call:
lm(formula = "price ~ gas_withdrawal_kwh + wind_production +\n    solar_production*day_length", 
    data = data_prices_production)

Residuals:
    Min      1Q  Median      3Q     Max 
-514.28 -108.97   -8.86   94.68 1320.63 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  4.381e+03  1.942e+03   2.256   0.0287 *  
gas_withdrawal_kwh           7.937e+00  1.575e+00   5.039 7.07e-06 ***
wind_production             -2.952e+01  4.036e+00  -7.314 2.43e-09 ***
solar_production            -2.461e+03  3.225e+03  -0.763   0.4491    
day_length                  -1.351e-01  7.633e-02  -1.770   0.0830 .  
solar_production:day_length  8.940e-02  1.229e-01   0.727   0.4706    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 265.6 on 48 degrees of freedom
Multiple R-squared:  0.7275,    Adjusted R-squared:  0.6991 
F-statistic: 25.63 on 5 and 48 DF,  p-value: 1.698e-12
I only now realise I need to update quite a few sections here if I ever need to regenerate this dataset in the future, luckily I maintain backups

Given that this dataset was (initially) collected in winter the solar energy production was very low, and it seemed to confirm that it didn’t affect energy prices so much in winter, which seems logical. I’d love to rerun this analysis in summer to see if it changes then. Too bad the energy price data I had available only goes back like 60 days. But I’ve set up a scheme to automatically extract this data daily using a CRON job on my home server, so maybe I’ll recreate this analysis in the future once I’ve collected data with a longer time window (yet another promise of more fun things to talk about in a future post).

Over the course of this analysis above we’ve downloaded quite a bit of data from various sources. While main goal here was to do some fun analysis on a topic I’m interested in. It was also a fun exercise in creating reusable functions for automated data extraction using various APIs and storing the data efficiently even though they are at different granularities. I will most likely also re-use the database we created here in future teaching sessions and workshops for SQL databases since it’s a database with a few useful links but not too large to be overwhelming to beginner users. Always good to be able to reuse analysis in different contexts. Just to summarise, here’s the tables we currently have in the database.

SELECT name, type
FROM sqlite_master
ORDER BY name;
nametype
cloud_covertable
day_lengthtable
energy_pricestable
gas_storagetable
municipalitytable
productiontable
production_weatherview
wind_speedtable

8 records

If you’ve gotten to part reading through all the text and code above, my hat is off to you. I maintain this blog mainly because I think it’s fun to do, and thank you for sharing my enthousiasm for this topic. I hope it has provided you with whatever you came here for and perhaps you learned something as well. I certainly did, and had fun at the same time. I made a few promises about things I’d write about in future blogposts (more advanced data engineering, better modeling, etc.), I hope I can find the time to write those posts soon! Thank you!