Forests in Transition: Visualizing Global Deforestation

Proposal

Data visualization
TidyTuesday

Packages Setup

Installed Packages
### GETTING THE LIBRARIES
if (!require(pacman))
  install.packages(pacman)

pacman::p_load(tidyverse,
               tidytuesdayR,
               dlookr,
               formattable)

Data-set

# Getting the Data using the tidytuesdayR package 
deforestation_data <- tidytuesdayR::tt_load(2021, week = 15)

# Getting all the underlying data in the dataset
forest        <- deforestation_data$forest
forest_area   <- deforestation_data$forest_area
brazil_loss   <- deforestation_data$brazil_loss
soybean_use   <- deforestation_data$soybean_use
vegetable_oil <- deforestation_data$vegetable_oil

#Data is read to deforestation_by_source from a raw csv file which is in github , as it is not being downloaded from the tidytuesdayR package.
deforestation_by_source <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/deforestation_by_source.csv')

The Global Deforestation data-set is published by Hannah Ritchie and Max Roser (2021) in the “Our World in Data” journal. This data-set contains comprehensive information on global forest cover, deforestation, and related factors. This data-set includes multiple data attributes:

  • forest : Data on net forest conversion and change in forest cover by country, over time.
  • forest_area : Information on the change in global forest area as a percent of the global forest area.
  • brazil_loss : Details on the loss of Brazilian forest due to various factors.
  • soybean_use : Data on soybean production and use for the years and countries.
  • vegetable_oil : Vegetable oil production by crop type and year.

Forest Data

Forest Data Diagnosis Code
# Get basic information about the dataset
# The properties of the data
forest |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
entity character 0 0.000000 132 0.277894737
code character 8 1.684211 131 0.275789474
year numeric 0 0.000000 4 0.008421053
net_forest_conversion numeric 0 0.000000 322 0.677894737

The Forest data-set contains information on the change in forest area every 5 years. With 475 observations, the data-set contains 4 variables: entity (country), code, year, and the net_forest_conversion in hectares. The column code has some missing values, around 1%, and we won’t be using that column since we have entity column.

Forest Area Data

Forest Area Data Diagnosis Code
# Get basic information about the dataset
# The properties of the data   
forest_area |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
entity character 0 0.00000 260 0.033137905
code character 1072 13.66301 225 0.028677033
year numeric 0 0.00000 31 0.003951058
forest_area numeric 0 0.00000 7532 0.959979607

The Forest area data-set looks at the change in global forest area as a percent of global forest area amongst a sample of 7846 observations. The data collected consists of 4 variables: entity, code, year, and forest_area (percentage of forest). The column code has some missing values around 13% and this column will not be used as we have entity column.

Brazil Loss Data

Brazil Loss Data Diagnosis Code
# Get basic information about the dataset
# The properties of the data
brazil_loss |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
entity character 0 0 1 0.07692308
code character 0 0 1 0.07692308
year numeric 0 0 13 1.00000000
commercial_crops numeric 0 0 12 0.92307692
flooding_due_to_dams numeric 0 0 5 0.38461538
natural_disturbances numeric 0 0 10 0.76923077
pasture numeric 0 0 13 1.00000000
selective_logging numeric 0 0 10 0.76923077
fire numeric 0 0 9 0.69230769
mining numeric 0 0 4 0.30769231
other_infrastructure numeric 0 0 5 0.38461538
roads numeric 0 0 8 0.61538462
tree_plantations_including_palm numeric 0 0 8 0.61538462
small_scale_clearing numeric 0 0 12 0.92307692

The Brazil loss data-set compares loss of Brazilian forest across different types of forest disturbances.13 observations are analyzed amongst 14 variables which include entity, code, year, commercial_crops, flooding_due_to_dam, natural_disturbance, pasture for livestock, selective_logging for lumber, fire loss, mining, other_infrastructure, roads, tree_plantation, and small_scale_clearing. There are no missing values in the data after running diagnosis() function.

Soybean Usage Data

Soybean Usage Data Diagnosis Code
# Get basic information about the dataset
# The properties of the data
soybean_use |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
entity character 0 0.000000 207 0.020915429
code character 1734 17.520461 174 0.017581085
year numeric 0 0.000000 53 0.005355158
human_food numeric 215 2.172375 712 0.071940992
animal_feed numeric 5538 55.956350 705 0.071233707
processed numeric 3644 36.819238 1989 0.200969991

The “Soybean_use” data-set consists of information relating to soybean consumption and use by year and country. 9897 observations were analyzed across 6 variables: entity, code, year, use for human food (e.g., tempeh, tofu), used for animal food, and processed into vegetable oil/bio-fuel/processed animal feed. The columns animal_feed and processed are having significant missing values.

Vegetable oil Data

Vegetable Oil Data Diagnosis Code
# Get basic information about the dataset
# The properties of the data
vegetable_oil |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
entity character 0 0.00000 232 1.612993e-03
code character 22633 15.73572 199 1.383559e-03
year numeric 0 0.00000 54 3.754380e-04
crop_oil character 0 0.00000 13 9.038322e-05
production numeric 85635 59.53821 26443 1.838464e-01

In the vegetable oil data-set, 143,832 observations were used to analyze the vegetable oil production by crop type and year. The variables consists of entity, code, year, crop oil and production which contains the production vegetable oil, and oil production in tons. The column production has significant missing values.

Deforestation by Source

Deforestation Source Data Diagnosis Code
# The properties of the data
deforestation_by_source |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
Entity character 0 0 10 1.0
Code logical 10 100 1 0.1
Year numeric 0 0 1 0.1
Forest loss (ha) numeric 0 0 10 1.0

There are 10 observations in the deforestation_by_source data-set which compares different farming entities to forest loss by year. The data-set includes four variables: entity, country, year, and forest loss. The column code has some missing values around 100% and we won’t be using that column, since we have entity column.

Why we chose this data-set?

The selection of data-sets for this project is driven by both technical and analytical considerations, particularly related to data visualization. Moreover, deforestation is a serious environmental problem with far-reaching effects. Understanding its trends, drivers, and impacts is critical for informed decision-making and environmental protection. The data-set from Our World in Data provides a reliable data with less junk data and also contains comprehensive set of variables related to deforestation in this scenario.

The data-set is well-structured, making them suitable for editing and visualizing data. It contains various tables like forest, brazil_loss, soybean_use, forest_area, vegetable_oil and deforestation_by_source that allows analysis to target specific aspects of deforestation. Different variables and data dimensions allow us to create diverse visualizations to consider different aspects of deforestation and investigate them.

The richness of the data-set provides many opportunities to create effective data visualizations. From time series graphs to choropleth maps to scatter plots, there are many visualization techniques that can effectively convey complex information. In summary, the selection of this data-set is driven by its technical suitability for analysis and its relevance to the critical issue of deforestation.

Questions

The two questions to be answered are:

Question 1: What does the global forest area look like over past decades, highlighting the trends of forest area conversion?

Question 2: How has the production of Soybean in Brazil changed over time, and how does it impact the afforestation or deforestation rates?

Analysis plan

The following are the approaches we will be using for each question.

Approach for question 1

We want to visualize how the global area under forests has changed over the years. To represent the available data best, we will be creating a choropleth map of the world that displays the net forest conversion across the world. The comparison of the plot will be done through 1990-2015 for each decade. We are planning to clean it further to only get certain parts that are required using dplyr. For instance, we will focus on examining forest data to address our first research question by determining which columns are relevant to our analysis, which include net_forest_conversion, year and entity. Additionally, data cleaning and preparation will be performed in order to account for missing data. Since we will be analyzing temporal trends, it will prove vital to filter out missing values for the variable year.

And also the data-set doesn’t contain any geographical information to use them to create a map plot. But it has country information in entity variable, which can be used to get the geographical information. We will be using the maps package as an external data source to get the relevant information using map_data() function and then merge the obtained data with our data using the country variable. This creates new variables latitude and longitudes of the respective countries.

Then we will us the obtained data-set to plot the map plot using geom_polygon() from ggplot. Also, in the data visualization we will attempt to make the plot interactive using plotly package allowing users to enhance their understanding.

Approach for question 2

We intend to illustrate the trend in the production of soybean over the years in Brazil in this plot. To calculate the entire production, three different soybean consumption are combined together. To perform this task, we are going to use the TidyTuesday data-set which was sourced from Our World in Data and perform data manipulation to obtain a new column showing the total production of soybean. We will be primarily focusing on the soyabean_use data, which is comprised of the columns (variables) human_food ,animal_feed and processed. Utilizing dplyr, we will filter data in the column entity to exclude data from outside the continent of Brazil as the variable includes various countries and continents. Similar to our approach for Question 1, data cleaning and preparation will be performed in order to account for missing data.

The rate of change in production will then be visualized using ggplot’s geom_line() and geom_point() methods to construct a time series plot (line and point graph). To evaluate the overall trend in both, we will also compare and correlate changes in soybean production with rates of deforestation and afforestation. The 1990–2013 period will be used for this data comparison as the data-set provides an abundance of useful insights. We also might use other data present in the data-set to get some correlation. In order to correlate the output of soyabean and forest area, we may also utilize bubble plot to observe how the forest area changed specifically in Brazil.

Variables of focus for both questions:

Variable Description Source Data-set
entity Country forest, forest_area and soybean
code Country Code forest, forest_area and soybean
year Year forest, forest_area and soybean
net_forest_conversion Net forest conversion in hectares forest
forest_area Percent of global forest area forest_area
human_food Use for human food (tempeh, tofu, etc) soybean
animal_feed Used for animal food soybean
processed Processed into vegetable oil, biofuel, processed animal feed soybean
Note:

These are the planned approaches, and we intend to explore and solve the problem statement which we came up with. Parts of our approach might change in the final project.