Forests in Transition: Visualizing Global Deforestation
Proposal
Packages Setup
Data-set
# Getting the Data using the tidytuesdayR package
deforestation_data <- tidytuesdayR::tt_load(2021, week = 15)
# Getting all the underlying data in the dataset
forest <- deforestation_data$forest
forest_area <- deforestation_data$forest_area
brazil_loss <- deforestation_data$brazil_loss
soybean_use <- deforestation_data$soybean_use
vegetable_oil <- deforestation_data$vegetable_oil
#Data is read to deforestation_by_source from a raw csv file which is in github , as it is not being downloaded from the tidytuesdayR package.
deforestation_by_source <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/deforestation_by_source.csv')
The Global Deforestation data-set is published by Hannah Ritchie and Max Roser (2021) in the “Our World in Data” journal. This data-set contains comprehensive information on global forest cover, deforestation, and related factors. This data-set includes multiple data attributes:
forest
: Data on net forest conversion and change in forest cover by country, over time.forest_area
: Information on the change in global forest area as a percent of the global forest area.brazil_loss
: Details on the loss of Brazilian forest due to various factors.soybean_use
: Data on soybean production and use for the years and countries.vegetable_oil
: Vegetable oil production by crop type and year.
Forest Data
Forest Data Diagnosis Code
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
entity | character | 0 | 0.000000 | 132 | 0.277894737 |
code | character | 8 | 1.684211 | 131 | 0.275789474 |
year | numeric | 0 | 0.000000 | 4 | 0.008421053 |
net_forest_conversion | numeric | 0 | 0.000000 | 322 | 0.677894737 |
The Forest data-set contains information on the change in forest area every 5 years. With 475 observations, the data-set contains 4 variables: entity
(country), code
, year
, and the net_forest_conversion
in hectares. The column code
has some missing values, around 1%, and we won’t be using that column since we have entity column.
Forest Area Data
Forest Area Data Diagnosis Code
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
entity | character | 0 | 0.00000 | 260 | 0.033137905 |
code | character | 1072 | 13.66301 | 225 | 0.028677033 |
year | numeric | 0 | 0.00000 | 31 | 0.003951058 |
forest_area | numeric | 0 | 0.00000 | 7532 | 0.959979607 |
The Forest area data-set looks at the change in global forest area as a percent of global forest area amongst a sample of 7846 observations. The data collected consists of 4 variables: entity
, code
, year
, and forest_area
(percentage of forest). The column code
has some missing values around 13% and this column will not be used as we have entity
column.
Brazil Loss Data
Brazil Loss Data Diagnosis Code
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
entity | character | 0 | 0 | 1 | 0.07692308 |
code | character | 0 | 0 | 1 | 0.07692308 |
year | numeric | 0 | 0 | 13 | 1.00000000 |
commercial_crops | numeric | 0 | 0 | 12 | 0.92307692 |
flooding_due_to_dams | numeric | 0 | 0 | 5 | 0.38461538 |
natural_disturbances | numeric | 0 | 0 | 10 | 0.76923077 |
pasture | numeric | 0 | 0 | 13 | 1.00000000 |
selective_logging | numeric | 0 | 0 | 10 | 0.76923077 |
fire | numeric | 0 | 0 | 9 | 0.69230769 |
mining | numeric | 0 | 0 | 4 | 0.30769231 |
other_infrastructure | numeric | 0 | 0 | 5 | 0.38461538 |
roads | numeric | 0 | 0 | 8 | 0.61538462 |
tree_plantations_including_palm | numeric | 0 | 0 | 8 | 0.61538462 |
small_scale_clearing | numeric | 0 | 0 | 12 | 0.92307692 |
The Brazil loss data-set compares loss of Brazilian forest across different types of forest disturbances.13 observations are analyzed amongst 14 variables which include entity
, code
, year
, commercial_crops
, flooding_due_to_dam
, natural_disturbance
, pasture
for livestock, selective_logging
for lumber, fire
loss, mining
, other_infrastructure
, roads
, tree_plantation
, and small_scale_clearing
. There are no missing values in the data after running diagnosis()
function.
Soybean Usage Data
Soybean Usage Data Diagnosis Code
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
entity | character | 0 | 0.000000 | 207 | 0.020915429 |
code | character | 1734 | 17.520461 | 174 | 0.017581085 |
year | numeric | 0 | 0.000000 | 53 | 0.005355158 |
human_food | numeric | 215 | 2.172375 | 712 | 0.071940992 |
animal_feed | numeric | 5538 | 55.956350 | 705 | 0.071233707 |
processed | numeric | 3644 | 36.819238 | 1989 | 0.200969991 |
The “Soybean_use” data-set consists of information relating to soybean consumption and use by year and country. 9897 observations were analyzed across 6 variables: entity
, code
, year
, use for human food
(e.g., tempeh, tofu), used for animal food
, and processed
into vegetable oil/bio-fuel/processed animal feed. The columns animal_feed
and processed
are having significant missing values.
Vegetable oil Data
Vegetable Oil Data Diagnosis Code
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
entity | character | 0 | 0.00000 | 232 | 1.612993e-03 |
code | character | 22633 | 15.73572 | 199 | 1.383559e-03 |
year | numeric | 0 | 0.00000 | 54 | 3.754380e-04 |
crop_oil | character | 0 | 0.00000 | 13 | 9.038322e-05 |
production | numeric | 85635 | 59.53821 | 26443 | 1.838464e-01 |
In the vegetable oil data-set, 143,832 observations were used to analyze the vegetable oil production by crop type and year. The variables consists of entity
, code
, year
, crop oil
and production
which contains the production vegetable oil, and oil production in tons. The column production
has significant missing values.
Deforestation by Source
Deforestation Source Data Diagnosis Code
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
Entity | character | 0 | 0 | 10 | 1.0 |
Code | logical | 10 | 100 | 1 | 0.1 |
Year | numeric | 0 | 0 | 1 | 0.1 |
Forest loss (ha) | numeric | 0 | 0 | 10 | 1.0 |
There are 10 observations in the deforestation_by_source
data-set which compares different farming entities to forest loss by year. The data-set includes four variables: entity
, country
, year
, and forest loss
. The column code
has some missing values around 100% and we won’t be using that column, since we have entity
column.
Why we chose this data-set?
The selection of data-sets for this project is driven by both technical and analytical considerations, particularly related to data visualization. Moreover, deforestation is a serious environmental problem with far-reaching effects. Understanding its trends, drivers, and impacts is critical for informed decision-making and environmental protection. The data-set from Our World in Data
provides a reliable data with less junk data and also contains comprehensive set of variables related to deforestation in this scenario.
The data-set is well-structured, making them suitable for editing and visualizing data. It contains various tables like forest
, brazil_loss
, soybean_use
, forest_area
, vegetable_oil
and deforestation_by_source
that allows analysis to target specific aspects of deforestation. Different variables and data dimensions allow us to create diverse visualizations to consider different aspects of deforestation and investigate them.
The richness of the data-set provides many opportunities to create effective data visualizations. From time series graphs to choropleth maps to scatter plots, there are many visualization techniques that can effectively convey complex information. In summary, the selection of this data-set is driven by its technical suitability for analysis and its relevance to the critical issue of deforestation.
Questions
The two questions to be answered are:
Question 1: What does the global forest area look like over past decades, highlighting the trends of forest area conversion?
Question 2: How has the production of Soybean in Brazil changed over time, and how does it impact the afforestation or deforestation rates?
Analysis plan
The following are the approaches we will be using for each question.
Approach for question 1
We want to visualize how the global area under forests has changed over the years. To represent the available data best, we will be creating a choropleth map of the world that displays the net forest conversion across the world. The comparison of the plot will be done through 1990-2015 for each decade. We are planning to clean it further to only get certain parts that are required using dplyr
. For instance, we will focus on examining forest
data to address our first research question by determining which columns are relevant to our analysis, which include net_forest_conversion
, year
and entity
. Additionally, data cleaning and preparation will be performed in order to account for missing data. Since we will be analyzing temporal trends, it will prove vital to filter out missing values for the variable year
.
And also the data-set doesn’t contain any geographical information to use them to create a map plot. But it has country information in entity variable, which can be used to get the geographical information. We will be using the maps
package as an external data source to get the relevant information using map_data()
function and then merge the obtained data with our data using the country variable. This creates new variables latitude and longitudes of the respective countries.
Then we will us the obtained data-set to plot the map plot using geom_polygon()
from ggplot. Also, in the data visualization we will attempt to make the plot interactive using plotly
package allowing users to enhance their understanding.
Approach for question 2
We intend to illustrate the trend in the production of soybean over the years in Brazil in this plot. To calculate the entire production, three different soybean consumption are combined together. To perform this task, we are going to use the TidyTuesday
data-set which was sourced from Our World in Data
and perform data manipulation to obtain a new column showing the total production of soybean. We will be primarily focusing on the soyabean_use
data, which is comprised of the columns (variables) human_food
,animal_feed
and processed
. Utilizing dplyr, we will filter data in the column entity to exclude data from outside the continent of Brazil as the variable includes various countries and continents. Similar to our approach for Question 1, data cleaning and preparation will be performed in order to account for missing data.
The rate of change in production will then be visualized using ggplot’s geom_line()
and geom_point()
methods to construct a time series plot (line and point graph). To evaluate the overall trend in both, we will also compare and correlate changes in soybean production with rates of deforestation and afforestation. The 1990–2013 period will be used for this data comparison as the data-set provides an abundance of useful insights. We also might use other data present in the data-set to get some correlation. In order to correlate the output of soyabean and forest area, we may also utilize bubble plot to observe how the forest area changed specifically in Brazil.
Variables of focus for both questions:
Variable | Description | Source Data-set |
---|---|---|
entity | Country | forest, forest_area and soybean |
code | Country Code | forest, forest_area and soybean |
year | Year | forest, forest_area and soybean |
net_forest_conversion | Net forest conversion in hectares | forest |
forest_area | Percent of global forest area | forest_area |
human_food | Use for human food (tempeh, tofu, etc) | soybean |
animal_feed | Used for animal food | soybean |
processed | Processed into vegetable oil, biofuel, processed animal feed | soybean |
These are the planned approaches, and we intend to explore and solve the problem statement which we came up with. Parts of our approach might change in the final project.