Forests in Transition: Visualizing Global Deforestation

Proposal

Data visualization

TidyTuesday

Packages Setup

Installed Packages

### GETTING THE LIBRARIES
if (!require(pacman))
  install.packages(pacman)

pacman::p_load(tidyverse,
               tidytuesdayR,
               dlookr,
               formattable)

Data-set

# Getting the Data using the tidytuesdayR package 
deforestation_data <- tidytuesdayR::tt_load(2021, week = 15)

# Getting all the underlying data in the dataset
forest        <- deforestation_data$forest
forest_area   <- deforestation_data$forest_area
brazil_loss   <- deforestation_data$brazil_loss
soybean_use   <- deforestation_data$soybean_use
vegetable_oil <- deforestation_data$vegetable_oil

#Data is read to deforestation_by_source from a raw csv file which is in github , as it is not being downloaded from the tidytuesdayR package.
deforestation_by_source <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/deforestation_by_source.csv')

The Global Deforestation data-set is published by Hannah Ritchie and Max Roser (2021) in the “Our World in Data” journal. This data-set contains comprehensive information on global forest cover, deforestation, and related factors. This data-set includes multiple data attributes:

forest : Data on net forest conversion and change in forest cover by country, over time.
forest_area : Information on the change in global forest area as a percent of the global forest area.
brazil_loss : Details on the loss of Brazilian forest due to various factors.
soybean_use : Data on soybean production and use for the years and countries.
vegetable_oil : Vegetable oil production by crop type and year.

Forest Data

Forest Data Diagnosis Code

# Get basic information about the dataset
# The properties of the data
forest |>
  diagnose() |>
  formattable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
entity	character	0	0.000000	132	0.277894737
code	character	8	1.684211	131	0.275789474
year	numeric	0	0.000000	4	0.008421053
net_forest_conversion	numeric	0	0.000000	322	0.677894737

The Forest data-set contains information on the change in forest area every 5 years. With 475 observations, the data-set contains 4 variables: entity (country), code, year, and the net_forest_conversion in hectares. The column code has some missing values, around 1%, and we won’t be using that column since we have entity column.

Forest Area Data

Forest Area Data Diagnosis Code

# Get basic information about the dataset
# The properties of the data   
forest_area |>
  diagnose() |>
  formattable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
entity	character	0	0.00000	260	0.033137905
code	character	1072	13.66301	225	0.028677033
year	numeric	0	0.00000	31	0.003951058
forest_area	numeric	0	0.00000	7532	0.959979607

The Forest area data-set looks at the change in global forest area as a percent of global forest area amongst a sample of 7846 observations. The data collected consists of 4 variables: entity, code, year, and forest_area (percentage of forest). The column code has some missing values around 13% and this column will not be used as we have entity column.

Brazil Loss Data

Brazil Loss Data Diagnosis Code

# Get basic information about the dataset
# The properties of the data
brazil_loss |>
  diagnose() |>
  formattable()

variables	types	unique_count	unique_rate
entity	character	1	0.07692308
code	character	1	0.07692308
year	numeric	13	1.00000000
commercial_crops	numeric	12	0.92307692
flooding_due_to_dams	numeric	5	0.38461538
natural_disturbances	numeric	10	0.76923077
pasture	numeric	13	1.00000000
selective_logging	numeric	10	0.76923077
fire	numeric	9	0.69230769
mining	numeric	4	0.30769231
other_infrastructure	numeric	5	0.38461538
roads	numeric	8	0.61538462
tree_plantations_including_palm	numeric	8	0.61538462
small_scale_clearing	numeric	12	0.92307692

The Brazil loss data-set compares loss of Brazilian forest across different types of forest disturbances.13 observations are analyzed amongst 14 variables which include entity, code, year, commercial_crops, flooding_due_to_dam, natural_disturbance, pasture for livestock, selective_logging for lumber, fire loss, mining, other_infrastructure, roads, tree_plantation, and small_scale_clearing. There are no missing values in the data after running diagnosis() function.

Soybean Usage Data

Soybean Usage Data Diagnosis Code

# Get basic information about the dataset
# The properties of the data
soybean_use |>
  diagnose() |>
  formattable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
entity	character	0	0.000000	207	0.020915429
code	character	1734	17.520461	174	0.017581085
year	numeric	0	0.000000	53	0.005355158
human_food	numeric	215	2.172375	712	0.071940992
animal_feed	numeric	5538	55.956350	705	0.071233707
processed	numeric	3644	36.819238	1989	0.200969991

The “Soybean_use” data-set consists of information relating to soybean consumption and use by year and country. 9897 observations were analyzed across 6 variables: entity, code, year, use for human food (e.g., tempeh, tofu), used for animal food, and processed into vegetable oil/bio-fuel/processed animal feed. The columns animal_feed and processed are having significant missing values.

Vegetable oil Data

Vegetable Oil Data Diagnosis Code

# Get basic information about the dataset
# The properties of the data
vegetable_oil |>
  diagnose() |>
  formattable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
entity	character	0	0.00000	232	1.612993e-03
code	character	22633	15.73572	199	1.383559e-03
year	numeric	0	0.00000	54	3.754380e-04
crop_oil	character	0	0.00000	13	9.038322e-05
production	numeric	85635	59.53821	26443	1.838464e-01

In the vegetable oil data-set, 143,832 observations were used to analyze the vegetable oil production by crop type and year. The variables consists of entity, code, year, crop oil and production which contains the production vegetable oil, and oil production in tons. The column production has significant missing values.

Deforestation by Source

Deforestation Source Data Diagnosis Code

# The properties of the data
deforestation_by_source |>
  diagnose() |>
  formattable()

variables	types	missing_count	missing_percent	unique_count	unique_rate
Entity	character	0	0	10	1.0
Code	logical	10	100	1	0.1
Year	numeric	0	0	1	0.1
Forest loss (ha)	numeric	0	0	10	1.0

There are 10 observations in the deforestation_by_source data-set which compares different farming entities to forest loss by year. The data-set includes four variables: entity, country, year, and forest loss. The column code has some missing values around 100% and we won’t be using that column, since we have entity column.

Why we chose this data-set?

The selection of data-sets for this project is driven by both technical and analytical considerations, particularly related to data visualization. Moreover, deforestation is a serious environmental problem with far-reaching effects. Understanding its trends, drivers, and impacts is critical for informed decision-making and environmental protection. The data-set from Our World in Data provides a reliable data with less junk data and also contains comprehensive set of variables related to deforestation in this scenario.

The data-set is well-structured, making them suitable for editing and visualizing data. It contains various tables like forest, brazil_loss, soybean_use, forest_area, vegetable_oil and deforestation_by_source that allows analysis to target specific aspects of deforestation. Different variables and data dimensions allow us to create diverse visualizations to consider different aspects of deforestation and investigate them.

The richness of the data-set provides many opportunities to create effective data visualizations. From time series graphs to choropleth maps to scatter plots, there are many visualization techniques that can effectively convey complex information. In summary, the selection of this data-set is driven by its technical suitability for analysis and its relevance to the critical issue of deforestation.

Questions

The two questions to be answered are:

Question 1: What does the global forest area look like over past decades, highlighting the trends of forest area conversion?

Question 2: How has the production of Soybean in Brazil changed over time, and how does it impact the afforestation or deforestation rates?

Analysis plan

The following are the approaches we will be using for each question.

Approach for question 1

We want to visualize how the global area under forests has changed over the years. To represent the available data best, we will be creating a choropleth map of the world that displays the net forest conversion across the world. The comparison of the plot will be done through 1990-2015 for each decade. We are planning to clean it further to only get certain parts that are required using dplyr. For instance, we will focus on examining forest data to address our first research question by determining which columns are relevant to our analysis, which include net_forest_conversion, year and entity. Additionally, data cleaning and preparation will be performed in order to account for missing data. Since we will be analyzing temporal trends, it will prove vital to filter out missing values for the variable year.

And also the data-set doesn’t contain any geographical information to use them to create a map plot. But it has country information in entity variable, which can be used to get the geographical information. We will be using the maps package as an external data source to get the relevant information using map_data() function and then merge the obtained data with our data using the country variable. This creates new variables latitude and longitudes of the respective countries.

Then we will us the obtained data-set to plot the map plot using geom_polygon() from ggplot. Also, in the data visualization we will attempt to make the plot interactive using plotly package allowing users to enhance their understanding.

Approach for question 2

We intend to illustrate the trend in the production of soybean over the years in Brazil in this plot. To calculate the entire production, three different soybean consumption are combined together. To perform this task, we are going to use the TidyTuesday data-set which was sourced from Our World in Data and perform data manipulation to obtain a new column showing the total production of soybean. We will be primarily focusing on the soyabean_use data, which is comprised of the columns (variables) human_food ,animal_feed and processed. Utilizing dplyr, we will filter data in the column entity to exclude data from outside the continent of Brazil as the variable includes various countries and continents. Similar to our approach for Question 1, data cleaning and preparation will be performed in order to account for missing data.

The rate of change in production will then be visualized using ggplot’s geom_line() and geom_point() methods to construct a time series plot (line and point graph). To evaluate the overall trend in both, we will also compare and correlate changes in soybean production with rates of deforestation and afforestation. The 1990–2013 period will be used for this data comparison as the data-set provides an abundance of useful insights. We also might use other data present in the data-set to get some correlation. In order to correlate the output of soyabean and forest area, we may also utilize bubble plot to observe how the forest area changed specifically in Brazil.

Variables of focus for both questions:

Variable	Description	Source Data-set
entity	Country	forest, forest_area and soybean
code	Country Code	forest, forest_area and soybean
year	Year	forest, forest_area and soybean
net_forest_conversion	Net forest conversion in hectares	forest
forest_area	Percent of global forest area	forest_area
human_food	Use for human food (tempeh, tofu, etc)	soybean
animal_feed	Used for animal food	soybean
processed	Processed into vegetable oil, biofuel, processed animal feed	soybean

Note:

These are the planned approaches, and we intend to explore and solve the problem statement which we came up with. Parts of our approach might change in the final project.