This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

HW 01 - Global plastic waste

Plastic pollution is a major and growing problem, negatively affecting oceans and wildlife health. Our World in Data has a lot of great data at various levels including globally, per country, and over time. For this individual homework we focus on data from 2010.

Additionally, National Geographic ran a data visualization communication contest on plastic waste as seen here.

Learning goals

Visualizing numerical and categorical data and interpreting visualizations
Recreating visualizations
Getting more practice using with R, RStudio, Git, and GitHub

Getting started

Go to the course GitHub organization and locate your assignment repo, which should be named hw1-plastic-waste-YOUR_GITHUB_USERNAME.

Grab the URL of the repo, and clone it in RStudio. Refer to Happy Git and GitHub for the useR if you would like to see step-by-step instructions for cloning a repo into an RStudio project.

First, open the R Markdown document hw1.Rmd and Knit it. Make sure it compiles without errors. The output will be in HTML .html file with the same name.

Packages

We’ll use the tidyverse package for this analysis. Run the following code in the Console to load this package.

library(tidyverse)

Data

The dataset for this assignment can be found as a csv file in the data folder of your repository. You can read it in using the following.

plastic_waste <- read_csv("data/plastic-waste.csv")

The variable descriptions are as follows:

code: 3 Letter country code
entity: Country name
continent: Continent name
year: Year
gdp_per_cap: GDP per capita constant 2011 international $, rate
plastic_waste_per_cap: Amount of plastic waste per capita in kg/day
mismanaged_plastic_waste_per_cap: Amount of mismanaged plastic waste per capita in kg/day
mismanaged_plastic_waste: Tonnes of mismanaged plastic waste
coastal_pop: Number of individuals living on/near coast
total_pop: Total population according to Gapminder

Warm up

Recall that RStudio is divided into four panes. Without looking, can you name them all and briefly describe their purpose?
Verify that the dataset has loaded into the Environment. How many observations are in the dataset? Clicking on the dataset in the Environment will allow you to inspect it more carefully. Alternatively, you can type View(plastic_waste) into the Console to do this.

Hint: If you’re not sure, run the command ?NA which will lead you to the documentation.

Have a quick look at the data and notice that there are cells taking the value NA – what does this mean?

Exercises

Let’s start by taking a look at the distribution of plastic waste per capita in 2010.

ggplot(data = plastic_waste, aes(x = plastic_waste_per_cap)) +
  geom_histogram(binwidth = 0.2)
## Warning: Removed 51 rows containing non-finite values (stat_bin).

One country stands out as an unusual observation at the top of the distribution. One way of identifying this country is to filter the data for countries where plastic waste per capita is greater than 3.5 kg/person.

plastic_waste %>%
  filter(plastic_waste_per_cap > 3.5)
## # A tibble: 1 x 10
##   code  entity   continent   year gdp_per_cap plastic_waste_p~ mismanaged_plast~
##   <chr> <chr>    <chr>      <dbl>       <dbl>            <dbl>             <dbl>
## 1 TTO   Trinida~ North Ame~  2010      31261.              3.6              0.19
## # ... with 3 more variables: mismanaged_plastic_waste <dbl>, coastal_pop <dbl>,
## #   total_pop <dbl>

Did you expect this result? You might consider doing some research on Trinidad and Tobago to see why plastic waste per capita is so high there, or whether this is a data error.

Q1. Plot, using histograms, the distribution of plastic waste per capita faceted by continent. What can you say about how the continents compare to each other in terms of their plastic waste per capita?

NOTE: From this point onwards the plots and the output of the code are not displayed in the homework instructions, but you can and should the code and view the results yourself.

# Your code below:
ggplot(data = plastic_waste, aes(x = plastic_waste_per_cap)) +
  geom_histogram(binwidth = 0.2) + 
  facet_wrap(~continent)
## Warning: Removed 51 rows containing non-finite values (stat_bin).

*As shown in the histogram, although each graph seems similar, there is relatively difference. North America continent can be seen wasting a lot of plastic per capita, while Oceania and South American continent waste relatively small.

Another way of visualizing numerical data is using density plots.

ggplot(data = plastic_waste, aes(x = plastic_waste_per_cap)) +
  geom_density()
## Warning: Removed 51 rows containing non-finite values (stat_density).

And compare distributions across continents by coloring density curves by continent.

ggplot(
  data = plastic_waste,
  mapping = aes(
    x = plastic_waste_per_cap,
    color = continent
  )
) +
  geom_density()
## Warning: Removed 51 rows containing non-finite values (stat_density).

The resulting plot may be a little difficult to read, so let’s also fill the curves in with colors as well.

ggplot(
  data = plastic_waste,
  mapping = aes(
    x = plastic_waste_per_cap,
    color = continent,
    fill = continent
  )
) +
  geom_density()
## Warning: Removed 51 rows containing non-finite values (stat_density).

The overlapping colors make it difficult to tell what’s happening with the distributions in continents plotted first, and hence covered by continents plotted over them. We can change the transparency level of the fill color to help with this. The alpha argument takes values between 0 and 1: 0 is completely transparent and 1 is completely opaque. There is no way to tell what value will work best, so you just need to try a few.

ggplot(
  data = plastic_waste,
  mapping = aes(
    x = plastic_waste_per_cap,
    color = continent,
    fill = continent
  )
) +
  geom_density(alpha = 0.7)
## Warning: Removed 51 rows containing non-finite values (stat_density).

This still doesn’t look great…

Q2. Recreate the density plots above using a different (lower) alpha level that works better for displaying the density curves for all continents.

# Your code below:
ggplot(
  data = plastic_waste,
  mapping = aes(
    x = plastic_waste_per_cap,
    color = continent,
    fill = continent
  )
) +
  geom_density(alpha = 0.2)
## Warning: Removed 51 rows containing non-finite values (stat_density).

*After lowering the transparency using alpha, we can see the curve graphs of all continents than before.

Q3. Describe why we defined the color and fill of the curves by mapping aesthetics of the plot but we defined the alpha level as a characteristic of the plotting geom.

[Since we used color and fill using mapping for specific continent variables. But, alpha for transparency changes the setting of the entire graphs, so we defined it separately as a characteristic of the plotting geom.]

🧶 ✅ ⬆️ Now is a good time to knit your document and commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

And yet another way to visualize this relationship is using side-by-side box plots.

ggplot(
  data = plastic_waste,
  mapping = aes(
    x = continent,
    y = plastic_waste_per_cap
  )
) +
  geom_boxplot()
## Warning: Removed 51 rows containing non-finite values (stat_boxplot).

Q4. Convert your side-by-side box plots from the previous task to violin plots. What do the violin plots reveal that box plots do not? What features are apparent in the box plots but not in the violin plots?

Remember: We use geom_point() to make scatterplots.

# Your code below:
ggplot(
  data = plastic_waste,
  mapping = aes(
    x = continent,
    y = plastic_waste_per_cap
  )
) +
  geom_violin()
## Warning: Removed 51 rows containing non-finite values (stat_ydensity).

*The violin plot shows a continuous distribution that was not seen in the box plot. And you can easy to see density well. On the other hand, the box plot can clearly show the outliers and medians, ranges and variabilities effectively.

Q5. Visualize the relationship between plastic waste per capita and mismanaged plastic waste per capita using a scatterplot. Describe the relationship.

# Your R code below:
ggplot(
  data = plastic_waste,
  mapping = aes(
    x = mismanaged_plastic_waste_per_cap,
    y = plastic_waste_per_cap
  )
) +
  geom_point()
## Warning: Removed 51 rows containing missing values (geom_point).

*The relationship between waste of plastic per capita and mismanaged plastic waste per capita shows a similar distribution, as shown in the Scatter plot.

Q6. Color the points in the scatterplot by continent. Does there seem to be any clear distinctions between continents with respect to how plastic waste per capita and mismanaged plastic waste per capita are associated?

# Your R code below:
ggplot(
  data = plastic_waste,
  mapping = aes(
    x = mismanaged_plastic_waste_per_cap,
    y = plastic_waste_per_cap, color = continent
  )
) +
  geom_point()
## Warning: Removed 51 rows containing missing values (geom_point).

*There is nothing particularly distinct, but plastics waste per capita and mismanaged are shown similar distributions.

Q7. Visualize the relationship between plastic waste per capita and total population as well as plastic waste per capita and coastal population. You will need to make two separate plots. Do either of these pairs of variables appear to be more strongly linearly associated?

# Your R code for the first plot below:
ggplot(
  data = plastic_waste,
  mapping = aes(
    x = plastic_waste_per_cap,
    y = total_pop
  )
) +
  geom_point()
## Warning: Removed 61 rows containing missing values (geom_point).

# Your R code for the second plot below:
ggplot(
  data = plastic_waste,
  mapping = aes(
    x = plastic_waste_per_cap,
    y = coastal_pop
  )
) +
  geom_point()
## Warning: Removed 51 rows containing missing values (geom_point).

*The linear association between the two graphs is almost similar, but shows more scatter distribution in the relationship between plastic waste per capita and coastal population.

🧶 ✅ ⬆️ Now is another good time to knit your document and commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Q8. Recreate the following plot, and interpret what you see in context of the data.

Hint: The x-axis is a calculated variable. One country with plastic waste per capita over 3 kg/day has been filtered out. And the data are not only represented with points on the plot but also a smooth curve. The term “smooth” should help you pick which geom to use.

Smoothed graph.

# Your code below:
plastic_waste_recreate <- plastic_waste %>%
  filter(plastic_waste_per_cap < 3)
ggplot(
  data = plastic_waste_recreate
  )+
  geom_point(
    mapping = aes(
      x = coastal_pop / total_pop,
      y = plastic_waste_per_cap, color = continent
      )
    )+ 
  geom_smooth(
    mapping = aes(
      x = coastal_pop / total_pop,
      y = plastic_waste_per_cap
      )
    )+
  labs(
    title = "Plastic waste vs. Coastal population proportion",
    subtitle = "by continent",
    x = "Coastal population proportion (coastal_pop / total_pop)",
    y = "Plastic waste per capita"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).

*As shown in the graph, it is a linear graph showing the correlation between the coastal population/total population and plastic waste per capita.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.

Once you’re done, check to make sure your latest changes are on GitHub and that you have a green indicator for the automated check for your R Markdown document knitting.

IS 407 HW 01