Fuel economy data

1.0.1 Exercises

library(dplyr)
library(ggplot2)

List five functions that you could use to get more information about the mpg dataset.

summary, table, mean, sd, var, quantile, max, min, range, ncol, is.na …

How can you find out what other datasets are included with ggplot2?

data(package = "ggplot2")

Data sets in package ‘ggplot2’:

diamonds                    Prices of over 50,000 round cut diamonds
economics                   US economic time series
economics_long              US economic time series
faithfuld                   2d density estimate of Old Faithful data
luv_colours                 'colors()' in Luv space
midwest                     Midwest demographics
mpg                         Fuel economy data from 1999 to 2008 for 38 popular
models of cars
msleep                      An updated and expanded version of the mammals sleep
dataset
presidential                Terms of 12 presidents from Eisenhower to Trump
seals                       Vector field of seal movements
txhousing                   Housing sales in TX

Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?

mpg %>% mutate(cgp100km = 2.35/cty*100, hgp100km = 235 / hwy)

Which manufacturer has the most models in this dataset?

mpg %>% count(manufacturer) %>% arrange(desc(n))

## # A tibble: 15 × 2
##    manufacturer     n
##    <chr>        <int>
##  1 dodge           37
##  2 toyota          34
##  3 volkswagen      27
##  4 ford            25
##  5 chevrolet       19
##  6 audi            18
##  7 hyundai         14
##  8 subaru          14
##  9 nissan          13
## 10 honda            9
## 11 jeep             8
## 12 pontiac          5
## 13 land rover       4
## 14 mercury          4
## 15 lincoln          3

mpg |>
  group_by(manufacturer) |>
  summarise(n = n()) |> 
  filter(n == max(n))

## # A tibble: 1 × 2
##   manufacturer     n
##   <chr>        <int>
## 1 dodge           37

tapply(mpg$manufacturer, INDEX = mpg$manufacturer, FUN = length) |> 
  # which.max()
  sort() |> 
  tail(1)

## dodge 
##    37

Which model has the most variations?

mpg %>% count(model) %>% arrange(desc(n))

## # A tibble: 38 × 2
##    model                   n
##    <chr>               <int>
##  1 caravan 2wd            11
##  2 ram 1500 pickup 4wd    10
##  3 civic                   9
##  4 dakota pickup 4wd       9
##  5 jetta                   9
##  6 mustang                 9
##  7 a4 quattro              8
##  8 grand cherokee 4wd      8
##  9 impreza awd             8
## 10 a4                      7
## # ℹ 28 more rows

Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?

mpg %>%
  mutate(model = sub(" 4wd", "", model),
         model = sub(" awd", "", model),
         model = sub(" 2wd", "", model),
         model = sub(" quattro", "", model)
  ) %>% count(model) %>% arrange(desc(n))

## # A tibble: 37 × 2
##    model               n
##    <chr>           <int>
##  1 a4                 15
##  2 caravan            11
##  3 ram 1500 pickup    10
##  4 civic               9
##  5 dakota pickup       9
##  6 jetta               9
##  7 mustang             9
##  8 grand cherokee      8
##  9 impreza             8
## 10 camry               7
## # ℹ 27 more rows