Exploratory Data Analysis

Exploratory data analysis (EDA) is an initial data investigation, that helps to inspect data and figure out what can (and cannot!) be done with data at hand, what trends can be observed and what statistical tests could be used (see “Data” section for information on data that we used).

Therefore, this section is dedicated to data exploration! Moreover, in this section we used A LOT of data visualisations. Data visualisation can benefit EDA by injecting rigour as well as support inferences. Actually, some argue that the sharp distinction between exploratory and explanatory data analyses has started blurring as the size of datasets keeps increasing.

What is in our data?

We have discussed the benefits of EDA, but what is in the datasets we have chosen?

Our data can be broadly defined as providing country-level data on climate and individual-level data on attitudes towards climate action and demographic characteristics. Therefore, for the clarity purposes, we will stick to this division for now :)

Country-level EDA

To answer data questions we created an interactive RShiny dashboard! Our goal is to ensure that not only we understand the data but also those who listen to our data stories. Indeed, interactive data visualisation is a great way to make data (and data science!) more accessible to everyone.

Where do countries stand in relation to each other?

In the previous subsection we explored climate-related data over time per country. It can show interesting trends, such as decreasing consumption of renewable energy in Kenya. But it might also be worthwhile to explore countries in relation to others. For example, the R Shiny Dashboard below (Figure 2; link in the image label) shows CO2 emissions in 2019. China and the United States emerge as two countries releasing the most CO2 emissions with China clearly leading the way. If you look at the CO2 emissions data from 2000 (co2_2000), China and the unites States still stand out but, back them, the US was the main producer of CO2 emissions. What happened during this time? Perhaps this change is a result of China’s economic development, but does this mean that it is also is the leading country for the consumption of renewable energy in 2019? Alas, no. Interestingly, African countries consume the most renewable energy (see ren_2019 variable).

On the other hand, we can see that the United States has remained, more or less, a leading country by the total number of renewable energy generators, while China is yet to catch up on this (Figure 3). Actually, the consumption of renewable energy has dropped in China (look at China in the EDA by Country). This is perhaps a result of increased economic activity, resulting in increased production of CO2 emissions but not expanding its renewable energy infrastructure.

However, analysis of OSM data needs to be considered carefully as year-by-year change does not necessarily indicate how many new generators have been built. It might be a case that it used not to be a popular mapping object! For example, in 2007 there’s no data but it does not mean generators were not there. Finally, it needs to be mentioned that only about 50 countries are on the map for the pragmatic reasons – data for each country had to be downloaded seperately, thus making it a time-intensive activity. Therefore, we focused on countries that are also a part of WVS datasets, thus enabling some to join datasets to generate insights.

# This is a Shiny web application. You can run the application by clicking
# the 'Run App' button above.
#
# Find out more about building applications with Shiny here:
#
#    http://shiny.rstudio.com/
#
library(shinydashboard)
library(geojsonio)
library(sf)
library(dashboardthemes)
library(tmaptools)
library(ggplot2)
library(shiny)
library(tidyverse)
library(shinythemes)
library(viridis)
#library(RColorBrewer)
#library(fields)
#library(ggsci)
library(paletteer)
library(RColorBrewer)
library(fields)
library(ggsci)
library(sf)
library(geojsonio)
library(tmap)
# load data
sfdf <- geojson_read("climate_action_data.geojson",what="sp") %>% st_as_sf()
# read excel file for OSM data
OSM = readxl::read_excel("osm.xlsx") 
world = spData::world
OSM$timestamp <- substr(OSM$timestamp,0,4)
#%>% rename("count_ren"='0')
# pivot_wider()
OSM1 <- OSM %>% pivot_wider(names_from = timestamp)
# join with world data 
OSM_sf <- world %>% select(name_long) %>%  left_join(OSM1,by=c("name_long"="Country")) 
# load data for WVS
wvs <- read_csv("wvs.csv")
wv <- wvs %>% select(country_5) %>% left_join((sfdf %>% st_drop_geometry() %>% select(name_long,dis_ISO3)),by=(c("country_5"="name_long")))
#sfdf20 <- sfdf[c(sample(1:nrow(sfdf),20)),]
# Define UI for application that draws a histogram
ui <- dashboardPage(
    dashboardHeader(title =tags$a("UNBigDataHackathon2022", href="https://gretatimaite.github.io/campr/",target="_blank"),
                    titleWidth=250),
    dashboardSidebar(
      width=250,
      sidebarMenu(
        
        
        menuItem("EDA by country",tabName = "widgets_together"),
        menuItem("EDA world map",tabName = "radioButtons_tmap"),
        menuItem("EDA WVS",tabName = "selectInput"),
        menuItem("EDA world renewables",tabName = "OSM")
    )
    ),
    dashboardBody(
      shinyDashboardThemes(
        theme = "blue_gradient"),
      tabItems(
        tabItem(tabName = "selectInput",
                h1("World View Survey (WVS) data plotted per country"),
                h3("Is environmental protection more important than economic growth?"),
                fluidRow (
                  selectInput("name1",
                              "Select Country",
                              choices=(unique(wvs %>% select(country_4))),selected = "ARG"),
                  
                  
                  fluidRow(plotOutput(outputId = "wvs")))
                #mod_selectInput_ui("selectInput_1")
                ),
        tabItem(tabName = "widgets_together",
              h1("Plotted values for selected country of selected variables"),
              h3("Data is plotted for the 2000-2019 period (minus NA values that are country specific)"),
              p("Legend: ren (% of renewable energy), temp (average yearly temperature),
                gdp(Total Gross Domestic product), dis (number of disasters), co2 (total CO2 emissions)."),
               #mod_widgets_together_ui("widgets_together_1")
               fluidRow (
               selectInput("name",
                            "Select Country",
                            choices=(unique(sfdf %>% st_drop_geometry() %>% select(name_long))),selected = "Kenya"),
                selectInput(inputId="var",
                            label = "Select y-axis variable",
                            choices=c("ren","co2","temp","dis","gdp"),selected = "ren")
        ,
        fluidRow(plotOutput(outputId = "regression_model")))
                ),
        tabItem(tabName = "radioButtons_tmap",
                h2("World map of selected variable"),
                tags$p("Legend: ren (% of renewable energy), temp (average yearly temperature),
                gdp(Total Gross Domestic product), dis (number of disasters), co2 (total CO2 emissions)."),
                
                radioButtons(inputId="vars",label= NULL,inline=TRUE,
                             choices = colnames(sfdf %>% as.data.frame() %>% select(starts_with(c("ren","co2","gdp","temp","dis"))))),
                fluidRow(
                 h3("Choropleth map of variables"),tmapOutput(outputId = "map")),
                  
            
                #mod_radioButtons_tmap_ui("radioButtons_tmap_1")
                ),
        tabItem(tabName = "OSM",
                h2("World map of count of renewable energy generators"),
                tags$p("Count of generators per country per year."),
                
                radioButtons(inputId="year",label= NULL,inline=TRUE,
                             choices = colnames(OSM_sf %>% as.data.frame() %>% select(-c(name_long,geom)))),
                fluidRow(
                  h3("Choropleth map for selected year"),
                  tmapOutput(outputId = "mapOSM")),
                
                
                #mod_radioButtons_tmap_ui("radioButtons_tmap_1")
        )
      )))
# Define server logic required to draw a histogram
server <- function(input, output) {
  # mod_selectInput_server("selectInput_1")
  # mod_widgets_together_server("widgets_together_1")
  # mod_radioButtons_tmap_server("radioButtons_tmap_1")
# regression models
  output$regression_model <- renderPlot({
    
    # t.c <- df_leeds %>% pull(sym!!(input$res_var))
    # t.t <- df_leeds %>% pull(sym!!(input$lot_var))
    #modeldensity_leeds <- data.frame(sfdf %>% pull(!!sym(input$res_var)),sfdf %>% pull(!!sym(input$lot_var)))
    # fit regressions
    
    # sfdf20 %>% st_drop_geometry() %>% ggplot(aes(x=sfdf20 %>% pull(!!sym(input$res_var)),y=sfdf20 %>% pull(!!sym(input$lot_var)))) +
    #   geom_smooth(method='lm') +
    #   geom_point(alpha=0.5) +
    #   xlab(input$res_var) +
    #   ylab(input$lot_var) +
    #   labs(title=paste0("Regression of ",input$res_var," and ",input$lot_var))
    coln <- sfdf %>% filter(name_long==input$name) %>%  st_drop_geometry() %>% select(-c(dis_Country,dis_ISO3,co2_Country.Name,gdpPercap)) %>%
      select(starts_with(c(input$var))) %>% names()
    coln1 <- as.numeric(substr(coln,nchar(coln)-4+1,nchar(coln)))
    values <- sfdf %>% st_drop_geometry() %>%filter(name_long==input$name) %>%  select(-c(dis_Country,dis_ISO3,co2_Country.Name,gdpPercap)) %>%
      select(starts_with(c(input$var))) %>% unname() %>%  as_vector() 
    plot(x=coln1,y=values)
    # library(stringr)
    df <- data.frame(coln1,values)
    ggplot(df) + geom_line(aes(x=coln1,y=values))  + geom_point(aes(x=coln1,y=values),lwd=2) +
      xlab(input$var) +
      
      theme_minimal() 
  })
  
  output$map <- renderTmap({
    tmap_options(basemaps = "OpenStreetMap")
  
    tm_shape(sfdf) +
      tm_polygons(col= input$vars,palette=viridis(n=7),alpha = 0.5)
  }) # end of renderTmap
  
  output$mapOSM <- renderTmap({
    tmap_options(basemaps = "OpenStreetMap")
    
    tm_shape(OSM_sf) +
      tm_polygons(col= input$year,palette=viridis(n=7),alpha = 0.5)
  }) # end of renderTmap
  # make tmap
  
  output$wvs <- renderPlot({
    theme_set(theme_light())
    # Explore how responses have changed over time
    env_count <- wvs %>% 
      # Select evn columns
      mutate("country_5"=as_vector( wv[,2])) %>% 
      filter(country_5==input$name1 |country_4==input$name1 |country_6==input$name1 |country_7==input$name1  ) %>% 
      select(contains("env")) %>% 
      #filter(if_any(everything(), is.na)) %>%  
      # Reshape data for easy analysis
      pivot_longer(everything(), names_to = "env", values_to = "opinion") %>% 
      #mutate(across(everything()))
      # Drop missing values
      drop_na() %>% 
      mutate(opinion = factor(opinion)) %>% 
      # Count number of respondents in each category
      count(env, opinion) %>% 
      group_by(env) %>% 
      mutate(total = sum(n)) %>% 
      ungroup() %>% 
      mutate(pct = n/total) %>% 
      # Rename rows
      mutate(env = case_when(
        env == "env_4_num" ~ "wave_4",
        env == "env_5_num" ~ "wave_5",
        env == "env_6_num" ~ "wave_6",
        env == "env_7_num" ~ "wave_7"
      ))
    
    # Visualize this
    env_count %>% 
      ggplot(mapping = aes(x = env, y = pct*100)) +
      geom_col(aes(fill = opinion), position = "dodge", alpha = 0.8) +
      paletteer::scale_fill_paletteer_d("ggthemes::Tableau_10",
                                        labels=c("protect environment", " Economic growth", "Other")) +
      ggtitle("Protecting environment vs Economic growth") +
      labs(x = "survey period",
           y = "% of respondents in survey") +
      theme(plot.title = element_text(hjust = 0.5))
  })
    
}

# Run the application 
shinyApp(ui = ui, server = server)

Individual-level EDA

This subsecion will focus on World Values Survey data, which, as we mentioned in the “Data” section, provides a representative sample for a number of countries.

# Load awesomeness
library(tidyverse)
library(sf)
library(here)
library(paletteer)

# Load awesome data 
# If you loaded this data while going through "Data" section then you don't have to load it again!
# We are just aiming to make each section able to stand on its own 
wvs <- read_csv("data/wvs.csv")
geo_df <- st_read("data/climate_action_data.geojson")
osm <- read_csv("result.csv")

Is climate action more important than economic growth?

In the “Introduction” section we mentioned that people tend to agree that climate change is an emergency. We wanted to examine this too. Therefore, we used a question from WVS asking if environmental protection should be prioritised over economic growth as a proxy for peoples attitudes towards climate action. If an individual thinks that environment matters more, then there’s a positive attitude, otherwise a negative attitude is held.

Visualisation in Figure 4 supports existing research and, indeed, in the last 20 years people (all countries considered) are more supportive towards climate change. The increase is really evident in the latest survey (Wave 7) which was carried out in 2017-2022.

# Define cleaning function
sel_mut <- function(df, col, new_col){
  df %>% 
    select(contains(col)) %>% 
    mutate(wave = new_col) %>% 
    rename_with(~str_replace(.x, "[:digit:]", "") %>% str_remove("_"))
}
theme_set(theme_light())
# Explore how responses have changed over time
env_count <- wvs %>% 
  # Select evn columns
  select(contains("env")) %>% 
  #filter(if_any(everything(), is.na)) %>% 
  # Reshape data for easy analysis
  pivot_longer(everything(), names_to = "env", values_to = "opinion") %>% 
#mutate(across(everything()))
  # Drop missing values
  drop_na() %>% 
  mutate(opinion = factor(opinion)) %>% 
  # Count number of respondents in each category
  count(env, opinion) %>% 
  group_by(env) %>% 
  mutate(total = sum(n)) %>% 
  ungroup() %>% 
  mutate(pct = n/total) %>% 
  # Rename rows
  mutate(env = case_when(
    env == "env_4_num" ~ "wave_4",
    env == "env_5_num" ~ "wave_5",
    env == "env_6_num" ~ "wave_6",
    env == "env_7_num" ~ "wave_7"
  ))

# Visualize this
env_count %>% 
  ggplot(mapping = aes(x = env, y = pct*100)) +
  geom_col(aes(fill = opinion), position = "dodge", alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("ggthemes::Tableau_10",
                                    labels=c("protect environment", " Economic growth", "Other")) +
  ggtitle("Protecting environment vs Economic growth") +
  labs(x = "survey period",
       y = "% of respondents in survey") +
  theme(plot.title = element_text(hjust = 0.5))

P.S. In the R Shiny dashboard you can also explore this quention on a country-level rather than aggregate data per wave.

Age and climate change attitudes

What demographic characteristics might shape attitudes towards the importance to protect environment rather than boost economic growth? We initially hypothesized that younger individuals are more supportive of climate action rather than older generation. Indeed, in Figure 5 we can observe similar trend – younger individuals prioritise economic protection over economic growth, yet this tends to vary from survey to sruvey, with a notable example of wave 6 conducted right after economic recession of 2007-2009.

# Reshape data for easier analysis
df <- wvs %>%
  sel_mut(col = "4", new_col = "wave_4") %>% 
  bind_rows(
    sel_mut(wvs, col = "5", new_col = "wave_5")
  ) %>%
  bind_rows(
    sel_mut(wvs, col = "6", new_col = "wave_6")
  ) %>% 
  bind_rows(
    sel_mut(wvs, col = "7", new_col = "wave_7")
  ) %>% 
  rename(env_opinion = env_num) %>% 
  mutate(across(everything(), factor))
df %>% 
  count(wave, env_opinion, age_num) %>% 
  drop_na() %>%
  group_by(age_num, wave) %>% 
  mutate(total = sum(n), pct = n/total) %>%
  ungroup() %>% 
  mutate(age_num = case_when(
    age_num == "1" ~ "16-24",
    age_num == "2" ~ "25-34",
    age_num == "3" ~ "35-44"
  )) %>% 
  ggplot(mapping = aes(x = age_num, y = pct*100)) +
  geom_col(aes(fill = env_opinion), position = "dodge", alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("ggthemes::Tableau_10",
                                    labels=c("protect environment", " Economic growth", "Other")) +
  ggtitle("Protecting environment vs Economic growth") +
  labs(x = "age",
       y = "% of respondents in survey") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(vars(wave), scales = "free_x")

Income and climate change attitudes

Our previous graph (Figure 5) hinted that perhaps personal and global economic situation has more effects on climate action views than age. Indeed, this is further supported by Figure 6 which shows respondent’s income level against economic growth vs environmental support variable. Indeed, in all waves people are more supportive of environmental protection with an exception of respondents with low income in wave 6, which, as it was mentioned, was conducted right after the economic recession.

# Regroup income levels
df %>% 
  drop_na() %>% 
  mutate(income_num = as.numeric(income_num)) %>% 
  mutate(income = case_when(
    income_num < 4 ~ "low",
    income_num > 7 ~ "high",
    TRUE ~ "middle"
  ),
  
  
  # Account for changes in wave7 encoding
  income = case_when(
    wave == "wave_7" & income_num == 1 ~ "low", 
    wave == "wave_7" & income_num == 2 ~ "middle",
    wave == "wave_7" & income_num == 3 ~ "high",
    TRUE ~ income
  ),
  
  
  income = factor(income, levels = c("high", "middle", "low"))) %>%
  # Count people in a wave sharing same income level and opinion
  count(wave, env_opinion, income) %>% 
  # For people in a wave sharing the same income, what's their total
  group_by(income, wave) %>% 
  # Find how opinion varies among folks sharing the same income 
  mutate(total = sum(n), pct = n/total) %>%
  ungroup() %>% 
  ggplot(mapping = aes(x = income, y = pct*100)) +
  geom_col(aes(fill = env_opinion), position = "dodge", alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("ggthemes::Tableau_10",
                                    labels=c("protect environment", " Economic growth", "Other")) +
  ggtitle("Protecting environment vs Economic growth") +
  labs(x = "income_levels",
       y = "% of respondents in survey") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(vars(wave), scales = "free_x")

Education and climate change attitudes

Finally, we looked into education. In the Peoples’ Climate Vote it is reported that education is the most significant factor behind views towards climate change – more educated individuals tend to recognise climate change as a global problem regardless of their country of origin. The results of the report align with the insights provided by Figure 7. It is clear that individuals with higher education (e.g. with Bachelor’s degree) are more supportive towards environmental protection.

df %>% 
  mutate(education_num = as.numeric(education_num)) %>% 
  mutate(education = case_when(
    education_num < 3 ~ "lower",
    education_num > 4 ~ "higher",
    TRUE ~ "middle"
  ),
  
  # Account for changes in wave7 encoding
  education = case_when(
    wave == "wave_7" & education_num == 2 ~ "middle", 
    wave == "wave_7" & education_num == 3 ~ "higher",
    TRUE ~ education
  ),
  
  
  education = factor(education, levels = c("higher", "middle", "lower"))) %>%
  # Count people in a wave sharing same education level and opinion
  count(wave, env_opinion, education) %>%
  drop_na() %>% 
  # For people in a wave sharing the same education, what's their total
  group_by(education, wave) %>% 
  # Find how opinion varies among folks sharing the same education level
  mutate(total = sum(n), pct = n/total) %>%
  ungroup() %>% 
  ggplot(mapping = aes(x = education, y = pct*100)) +
  geom_col(aes(fill = env_opinion), position = "dodge", alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("ggthemes::Tableau_10",
                                    labels=c("protect environment", " Economic growth", "Other")) +
  ggtitle("Protecting environment vs Economic growth") +
  labs(x = "education_levels",
       y = "% of respondents in survey") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(vars(wave), scales = "free_x")

Final comments

In this section we used interactive and static visualisations to explore the data and form some hypothesis of what might data might be telling us. For instance, we observed China emerging as a leading CO2 emissions producer over the last 20 years, yet it remains mediocre in terms of the number of renewable energy generators present in the country. Indeed, if you look, the total consumption of renewable energy has dropped in China. On the population level we noticed that there is a clear trend of more education people to support environmental protection. On the other hand, age and income provide less evident affects.