Hot Ones Episodes

Proposal

if (!require("pacman"))
  install.packages("pacman")
pacman::p_load(tidyverse, tidytuesdayR)

Dataset

episodes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv')
sauces <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv')
seasons <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv')

Dataset Name - Hot Ones Episodes

Hot Ones is a popular YouTube series which was started on 12th March 2015. The concept behind the show was to serve the guests with hot sauces which have varying heat and spiciness numbers on the scoville scale. The show often features rare and unique hot sauces from around the world, the spiciness of these sauces is measured by the scoville scale.

Best Last Dab Reactions of All Time Video - In case you are unfamiliar with the show Hot Ones

https://youtu.be/-at3oShfXH8?si=iYmT2YOsBiBPwFap

Hot Ones Episodes is a data set comes from the TidyTuesday repository. The Data Set that we choose contains three tables or three sets of data.

  1. The first datset gives detailed information about the episodes

    1. It contains the data about each episode until July 2023. The data set contains 300 observations.

    2. The fields and the explanation of the episode data set are as follows

    episodes.csv
    Variable Name Description
    season The season number.
    episode_overall The overall count of episode, from 1-300
    episode_season The count of this episode within this season
    title The title of the episode
    original_release The date on which the episode was originally available on YouTube
    guest The name of the guest
    guest_appearance_number The number of appearances by this guest so far as of this date
    finished Whether the guest finished trying all of the sauces
  2. The second data set contains the data related to the sauces

    1. Mainly it gives the description about the seasons in which the sauces were used and the Scoville scale reading (https://en.wikipedia.org/wiki/Scoville_scale)
    2. It contains 210 observations
    3. The variables used and the interpretation of the data table is as follows
    sauces.csv
    Variable Name Description
    season The season number
    sauce_number The number of this sauce, from 1 (least hot) to 10 (hottest)
    sauce_name The name of the sauce
    scoville The rating of the sauce in Scoville heat units.
  3. The Third data set contains the data related to seasons

    1. It mainly shows the episodes per season, release date and the end date of the season. The data set contains record of 21 observations.
    2. The variables used and their descriptions are as follows
    seasons.csv
    Variable Name Description
    season The season number
    episodes The count of episodes in this season
    note Notes about this season
    original_release The date of the first episode in this season
    last_release The date of the last episode of this season (if that episode has aired at the time of scraping)

Reason to choose the Dataset

The dataset represents an intriguing and distinctive topic. It is neither too small for meaningful analysis nor excessively large to hinder us from drawing conclusions. It pertains to a subject that not only our team members are familiar with but is also widely recognized. Having watched the episodes and gaining insights into the data collection process, we found it easier to connect with the dataset and analyze it more effectively.

Questions

Q1 ) How does the average scoville score of the season affect the number of guests who did not finish trying all the sauces?

Q2 ) Are there any sauces that are used in more than one season? Visualize which sauces and in which seasons?

Analysis plan

  • For the first question, we plan on creating the variable “unfinished_count” which represents the number of guests that failed to try all the sauces for each season. Then we create another variable “average_scoville” which is the average of the variable “scoville” for each season. Then we can make a scatter plot in the first layer with “average_scoville” on the x-axis and “unfinished_count” on the y-axis. We will also add a geom_smooth in the second layer to analyse the trend. Here we calculate the “unfinished_count” variable from “episodes.csv” and the “average_scoville” variable from “sauces.csv”. We are planning to find out the season which has the most unfinished contestants and analyse the sauce score for that. This way, we will be able to tell if the average scoville score have any impact on the number of contestants that failed to try all the sauces.

  • For the second question, we plan on extracting the data of the sauces that appeared more than once in the series and plot a line chart to visualize which season has more number of repetitive sauces. In other words, this will show the sauces that were used multiple times throughout the seasons. We want to do this because every season they change most of the lineup of hot sauces used and we would like to see which hot sauces remained in use in multiple seasons. By doing this, we can see if the scoville of the hot sauces reused was higher or lower than the average scoville of the season.