9  Exploratory Data Analysis

Author
Affiliation

Amiya Ranjan Bhowmick

Institute of Chemical Technology, Mumbai

9.1 Introduction

This section is primarily devoted to the exploration of various datasets using R. Students explore different datasets available in the datasets package in R, giving them exposure to real-world data. In the previous few lectures, we primarily studied various probability distributions and their properties, which are theoretical in nature. In the classroom, students explore different datasets, discuss them among themselves, and read the help files to gain detailed information about each dataset. Some examples of datasets explored by the students are provided here. Some testing problems have also been discussed.

9.2 The ChickWeight dataset

The \(\texttt{datasets}\) is a package in R, part of the base R packages. Let us get into real data set. We consider the ChickWeight data set available in R. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups on chicks on different protein diets.

# datasets::ChickWeight
head(ChickWeight)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1
nrow(ChickWeight)
[1] 578
# help("ChickWeight")
class(ChickWeight)
[1] "nfnGroupedData" "nfGroupedData"  "groupedData"    "data.frame"    
names(ChickWeight)
[1] "weight" "Time"   "Chick"  "Diet"  
summary(ChickWeight)
     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506          

The function class is important to understand different data types. In data analysis, this function is often useful to check for the correct input type to specific functions in R.

class(ChickWeight$weight)
[1] "numeric"
class(ChickWeight$Time)
[1] "numeric"
class(ChickWeight$Chick)
[1] "ordered" "factor" 
class(ChickWeight$Diet)
[1] "factor"

9.2.1 Checking for missing values

head(is.na.data.frame(ChickWeight))   # check for missing values
  weight  Time Chick  Diet
1  FALSE FALSE FALSE FALSE
2  FALSE FALSE FALSE FALSE
3  FALSE FALSE FALSE FALSE
4  FALSE FALSE FALSE FALSE
5  FALSE FALSE FALSE FALSE
6  FALSE FALSE FALSE FALSE
sum(is.na.data.frame(ChickWeight))
[1] 0

9.2.2 \(\texttt{is.na}\) function

x = c(1, 2, 3, 4, NA, 5, 4:1000)
print(x)
is.na(x)
sum(is.na(x))
which(is.na(x) == TRUE)

9.2.3 Reading columns in a data.frame

ChickWeight[ ,"Chick"]
  [1] 1  1  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  3 
 [26] 3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  5  5 
 [51] 5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  6  6  7  7  7 
 [76] 7  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9 
[101] 9  9  9  9  9  9  9  10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11
[126] 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13
[151] 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15
[176] 16 16 16 16 16 16 16 17 17 17 17 17 17 17 17 17 17 17 17 18 18 19 19 19 19
[201] 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21
[226] 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23
[251] 23 23 23 23 23 23 24 24 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25
[276] 25 25 25 25 25 26 26 26 26 26 26 26 26 26 26 26 26 27 27 27 27 27 27 27 27
[301] 27 27 27 27 28 28 28 28 28 28 28 28 28 28 28 28 29 29 29 29 29 29 29 29 29
[326] 29 29 29 30 30 30 30 30 30 30 30 30 30 30 30 31 31 31 31 31 31 31 31 31 31
[351] 31 31 32 32 32 32 32 32 32 32 32 32 32 32 33 33 33 33 33 33 33 33 33 33 33
[376] 33 34 34 34 34 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 35 35 35 35
[401] 36 36 36 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 37 37 38
[426] 38 38 38 38 38 38 38 38 38 38 38 39 39 39 39 39 39 39 39 39 39 39 39 40 40
[451] 40 40 40 40 40 40 40 40 40 40 41 41 41 41 41 41 41 41 41 41 41 41 42 42 42
[476] 42 42 42 42 42 42 42 42 42 43 43 43 43 43 43 43 43 43 43 43 43 44 44 44 44
[501] 44 44 44 44 44 44 45 45 45 45 45 45 45 45 45 45 45 45 46 46 46 46 46 46 46
[526] 46 46 46 46 46 47 47 47 47 47 47 47 47 47 47 47 47 48 48 48 48 48 48 48 48
[551] 48 48 48 48 49 49 49 49 49 49 49 49 49 49 49 49 50 50 50 50 50 50 50 50 50
[576] 50 50 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < ... < 48
ChickWeight[ ,"Time"]
  [1]  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0
 [26]  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2
 [51]  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4
 [76]  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20  0  2  4  6  8
[101] 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10
[126] 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12
[151] 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14
[176]  0  2  4  6  8 10 12  0  2  4  6  8 10 12 14 16 18 20 21  0  2  0  2  4  6
[201]  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8
[226] 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10
[251] 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12
[276] 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14
[301] 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16
[326] 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18
[351] 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20
[376] 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21
[401]  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0
[426]  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2
[451]  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4
[476]  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6
[501]  8 10 12 14 16 18  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12
[526] 14 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14
[551] 16 18 20 21  0  2  4  6  8 10 12 14 16 18 20 21  0  2  4  6  8 10 12 14 16
[576] 18 20 21
ChickWeight[ ,"weight"]
  [1]  42  51  59  64  76  93 106 125 149 171 199 205  40  49  58  72  84 103
 [19] 122 138 162 187 209 215  43  39  55  67  84  99 115 138 163 187 198 202
 [37]  42  49  56  67  74  87 102 108 136 154 160 157  41  42  48  60  79 106
 [55] 141 164 197 199 220 223  41  49  59  74  97 124 141 148 155 160 160 157
 [73]  41  49  57  71  89 112 146 174 218 250 288 305  42  50  61  71  84  93
 [91] 110 116 126 134 125  42  51  59  68  85  96  90  92  93 100 100  98  41
[109]  44  52  63  74  81  89  96 101 112 120 124  43  51  63  84 112 139 168
[127] 177 182 184 181 175  41  49  56  62  72  88 119 135 162 185 195 205  41
[145]  48  53  60  65  67  71  70  71  81  91  96  41  49  62  79 101 128 164
[163] 192 227 248 259 266  41  49  56  64  68  68  67  68  41  45  49  51  57
[181]  51  54  42  51  61  72  83  89  98 103 113 123 133 142  39  35  43  48
[199]  55  62  65  71  82  88 106 120 144 157  41  47  54  58  65  73  77  89
[217]  98 107 115 117  40  50  62  86 125 163 217 240 275 307 318 331  41  55
[235]  64  77  90  95 108 111 131 148 164 167  43  52  61  73  90 103 127 135
[253] 145 163 170 175  42  52  58  74  66  68  70  71  72  72  76  74  40  49
[271]  62  78 102 124 146 164 197 231 259 265  42  48  57  74  93 114 136 147
[289] 169 205 236 251  39  46  58  73  87 100 115 123 144 163 185 192  39  46
[307]  58  73  92 114 145 156 184 207 212 233  39  48  59  74  87 106 134 150
[325] 187 230 279 309  42  48  59  72  85  98 115 122 143 151 157 150  42  53
[343]  62  73  85 102 123 138 170 204 235 256  41  49  65  82 107 129 159 179
[361] 221 263 291 305  39  50  63  77  96 111 137 144 151 146 156 147  41  49
[379]  63  85 107 134 164 186 235 294 327 341  41  53  64  87 123 158 201 238
[397] 287 332 361 373  39  48  61  76  98 116 145 166 198 227 225 220  41  48
[415]  56  68  80  83 103 112 135 157 169 178  41  49  61  74  98 109 128 154
[433] 192 232 280 290  42  50  61  78  89 109 130 146 170 214 250 272  41  55
[451]  66  79 101 120 154 182 215 262 295 321  42  51  66  85 103 124 155 153
[469] 175 184 199 204  42  49  63  84 103 126 160 174 204 234 269 281  42  55
[487]  69  96 131 157 184 188 197 198 199 200  42  51  65  86 103 118 127 138
[505] 145 146  41  50  61  78  98 117 135 141 147 174 197 196  40  52  62  82
[523] 101 120 144 156 173 210 231 238  41  53  66  79 100 123 148 157 168 185
[541] 210 205  39  50  62  80 104 125 154 170 222 261 303 322  40  53  64  85
[559] 108 128 152 166 184 203 233 237  41  54  67  84 105 122 155 175 205 234
[577] 264 264
ChickWeight[ ,"Diet"]
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
[223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[334] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[445] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[482] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[519] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[556] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4

9.2.4 Subsetting of data

  • We will collect only the weight profile of the chick (coded as 1) and given the diet type 1. The following code will help us to obtain the data and plot the profile.
w = ChickWeight[1:12 ,"weight"]
t = ChickWeight[1:12 ,"Time"]
plot(t,w, col = "red", pch = 2,
     xlab = "Time", ylab = "Weight (gm)",
     main = "Diet - I", cex = 1.3, lwd = 3)

We can also separate the data for different types of diet. The function \(\texttt{subset}\) is useful in doing so.

diet_1 = subset(ChickWeight, Diet == 1, select = c(weight, Time, Chick))
diet_2 = subset(ChickWeight, Diet == 2, select = c(weight, Time, Chick))
diet_3 = subset(ChickWeight, Diet == 3, select = c(weight, Time, Chick))
diet_4 = subset(ChickWeight, Diet == 4, select = c(weight, Time, Chick))
dim(diet_1)
[1] 220   3
dim(diet_2)
[1] 120   3
dim(diet_3)
[1] 120   3
dim(diet_4)
[1] 118   3

Understanding the summary of the data is an important part of any exploratory data analysis.

summary(ChickWeight)
     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506          
summary(diet_1)
     weight            Time           Chick    
 Min.   : 35.00   Min.   : 0.00   13     : 12  
 1st Qu.: 57.75   1st Qu.: 4.00   9      : 12  
 Median : 88.00   Median :10.00   20     : 12  
 Mean   :102.65   Mean   :10.48   10     : 12  
 3rd Qu.:136.50   3rd Qu.:16.00   17     : 12  
 Max.   :305.00   Max.   :21.00   19     : 12  
                                  (Other):148  
summary(diet_2)
     weight           Time           Chick   
 Min.   : 39.0   Min.   : 0.00   24     :12  
 1st Qu.: 65.5   1st Qu.: 5.50   30     :12  
 Median :104.5   Median :11.00   22     :12  
 Mean   :122.6   Mean   :10.92   23     :12  
 3rd Qu.:163.0   3rd Qu.:16.50   27     :12  
 Max.   :331.0   Max.   :21.00   28     :12  
                                 (Other):48  
summary(diet_3)
     weight           Time           Chick   
 Min.   : 39.0   Min.   : 0.00   33     :12  
 1st Qu.: 67.5   1st Qu.: 5.50   37     :12  
 Median :125.5   Median :11.00   36     :12  
 Mean   :142.9   Mean   :10.92   31     :12  
 3rd Qu.:198.8   3rd Qu.:16.50   39     :12  
 Max.   :373.0   Max.   :21.00   38     :12  
                                 (Other):48  
summary(diet_4)
     weight            Time           Chick   
 Min.   : 39.00   Min.   : 0.00   45     :12  
 1st Qu.: 71.25   1st Qu.: 4.50   43     :12  
 Median :129.50   Median :10.00   41     :12  
 Mean   :135.26   Mean   :10.75   47     :12  
 3rd Qu.:184.75   3rd Qu.:16.00   49     :12  
 Max.   :322.00   Max.   :21.00   46     :12  
                                  (Other):46  

We can also have some first few or last few observations from the data set.

head(diet_1, n =13)
   weight Time Chick
1      42    0     1
2      51    2     1
3      59    4     1
4      64    6     1
5      76    8     1
6      93   10     1
7     106   12     1
8     125   14     1
9     149   16     1
10    171   18     1
11    199   20     1
12    205   21     1
13     40    0     2
tail(diet_1, n = 10)
    weight Time Chick
211     54    4    20
212     58    6    20
213     65    8    20
214     73   10    20
215     77   12    20
216     89   14    20
217     98   16    20
218    107   18    20
219    115   20    20
220    117   21    20

9.2.5 basic plotting function

plot(weight ~ Time, data = diet_1, main = "Diet = 1")

head(diet_1$Chick)
[1] 1 1 1 1 1 1
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < ... < 48

In the above plot, growth profiles of different chicks can not be identified separately. We can draw the growth profiles of each chick separately. We can have multiple growth profiles also in the same plot. We can create a single growth profile first (for chick = 1) and others can be added using the lines or points commands.

par(mfrow = c(1,1))
plot(weight ~ Time, data = subset(diet_1, Chick == 1), 
     col = 1, lwd = 2, type = "b", 
     ylim = c(min(diet_1$weight), max(diet_1$weight)),
     main = "Diet - I")
for(i in 2:20){
  lines(weight ~ Time, data = subset(diet_1, Chick == i), col = i,
        lwd = 2, type = "b")
}

It is evident from the picture that the variance associated with measurements of weights increases as the time increases. Using the box plot, we can get a more clearer understanding about the exact shape of the spread. Using par(mfrow = c(2,2)) we can obtain four plots in a single plot window.

par(mfrow = c(1,1))
boxplot(weight ~ Time, data = diet_1,
        main = "Diet = I", col = diet_1$Time,
        lwd = 2)

We can execute the same task for other diet type as well.

par(mfrow = c(1,3))
boxplot(weight ~ Time, data = diet_2,
        main = "Diet = 2", col = "grey",
        lwd = 1)
boxplot(weight ~ Time, data = diet_3,
        main = "Diet = 3", col = "grey",
        lwd = 1)
boxplot(weight ~ Time, data = diet_4,
        main = "Diet = 4", col = "grey",
        lwd = 1)

There are multiple ways to visualize the distribution of the data. violin plot is also commonly used to visualize the data distribution. This can be done by loading the package vioplot.

library(vioplot)
Loading required package: sm
Package 'sm', version 2.2-6.0: type help(sm) for summary information
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
par(mfrow = c(1,2))
vioplot(weight ~ Time, data = diet_1,
        main = "Diet = I", col = diet_1$Time,
        lwd = 1)
vioplot(weight ~ Time, data = diet_2,
        main = "Diet = 2", col = diet_2$Time,
        lwd = 1)

par(mfrow = c(1,2))
vioplot(weight ~ Time, data = diet_3,
        main = "Diet = 3", col = diet_3$Time,
        lwd = 1)
vioplot(weight ~ Time, data = diet_4,
        main = "Diet = 4", col = diet_4$Time,
        lwd = 1)

Shapes of data distributions

9.2.6 Comparing the final growth

diet_1_T21 = subset(diet_1, Time==21, select = c(weight))
head(diet_1_T21)
   weight
12    205
24    215
36    202
48    157
60    223
72    157
diet_2_T21 = subset(diet_2, Time==21, select = c(weight))
diet_3_T21 = subset(diet_3, Time==21, select = c(weight))
diet_4_T21 = subset(diet_4, Time==21, select = c(weight))

Let us create a new data set with only the final weight of the chicks. The output of the subset function returns a data.frame, not a vector.

weight = c(diet_1_T21$weight, diet_2_T21$weight,
           diet_3_T21$weight, diet_4_T21$weight)
Diet = c(rep(1, nrow(diet_1_T21)), rep(2, nrow(diet_2_T21)), rep(3, nrow(diet_3_T21)), rep(4, nrow(diet_4_T21)))
Diet = as.factor(Diet)         # store as factor
ChickWeight_T21 = data.frame(weight, Diet)

Let us make some informative plot of only the distribution of weights on 21st day for different dietary treatments.

par(mfrow = c(1,2))
boxplot(weight ~ Diet, data = ChickWeight_T21,
        lwd = 2, main = "Day = 21", col = 2:5)
vioplot(weight ~ Diet, data = ChickWeight_T21,
        lwd = 2, main = "Day = 21",
        col = 2:5)

If \(\mu_{21}^{(j)}, 1\le j\le 4\) be the mean weight of the chick when considered under dietary type \(j\in\{1,2,3,4\}\), respectively. We are interested to test the following null hypothesis given by \[H_0 \colon \mu_{21}^{(1)}=\mu_{21}^{(2)}=\mu_{21}^{(3)}=\mu_{21}^{(4)}\] against the alternative \(H_1\) which specifies that if at least two means are not equal. This falls under the statistical exercise known as the analysis of variance.

out = aov(weight ~ Diet, data = ChickWeight_T21)
summary(out)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         3  57164   19055   4.655 0.00686 **
Residuals   41 167839    4094                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Small \(p\)-value \(0.00686\) indicates that the null hypothesis is rejected at \(5\%\) level of significance.
par(mfrow = c(2,2))
plot(out)

  • Natural question arises which two are different?
out_HSD = TukeyHSD(out, data = ChickWeight_T21)
plot(out_HSD)

9.2.7 Role of assumptions

Checking for normality assumptions of the weight of chicks under each diet. We can use the function shapiro.test() to check whether the data distribution is normal. The null hypothesis is \[\begin{eqnarray} H_0\colon &~&\mbox{ the data are normally distributed}\\ H_1\colon &~& \mbox{ the data are not normally distributed} \end{eqnarray}\]

shapiro.test(ChickWeight_T21$weight[Diet == 1])

    Shapiro-Wilk normality test

data:  ChickWeight_T21$weight[Diet == 1]
W = 0.95602, p-value = 0.5905
shapiro.test(ChickWeight_T21$weight[Diet == 2])

    Shapiro-Wilk normality test

data:  ChickWeight_T21$weight[Diet == 2]
W = 0.97725, p-value = 0.9488
shapiro.test(ChickWeight_T21$weight[Diet == 3])

    Shapiro-Wilk normality test

data:  ChickWeight_T21$weight[Diet == 3]
W = 0.97045, p-value = 0.895
shapiro.test(ChickWeight_T21$weight[Diet == 4])

    Shapiro-Wilk normality test

data:  ChickWeight_T21$weight[Diet == 4]
W = 0.88694, p-value = 0.1855

A large \(p\) value indicates that the data are normally distributed. Here the null hypothesis is that the data distribution is normal. Therefore, a large \(p\)-value indicates the acceptance of the null hypothesis. There are other visual representations to check the normality assumption.

par(mfrow = c(2,2))
qqnorm(ChickWeight_T21$weight[Diet == 1], col = "red", cex = 1.3, lwd = 2)
qqline(ChickWeight_T21$weight[Diet == 1])
qqnorm(ChickWeight_T21$weight[Diet == 2], col = "red", cex = 1.3, lwd = 2)
qqline(ChickWeight_T21$weight[Diet == 2])
qqnorm(ChickWeight_T21$weight[Diet == 3], col = "red", cex = 1.3, lwd = 2)
qqline(ChickWeight_T21$weight[Diet == 3])
qqnorm(ChickWeight_T21$weight[Diet == 4], col = "red", cex = 1.3, lwd = 2)
qqline(ChickWeight_T21$weight[Diet == 4])

9.2.8 Some more options of boxplot using ggplot2

The package ggplot2 offers several options for beautiful graphics using R. In the following, we demonstrate how to draw the side by side boxplot of continuous variable with respect to a factor variable in the data.

library(ggplot2)
ggplot2::ggplot(data = ChickWeight_T21,
                aes(Diet, weight, fill = Diet)) +
  geom_boxplot() + 
  scale_y_continuous("Chick weight (gm)",
                     breaks = seq(50, 400, by = 50)) + 
  labs(title = "Distribution of weight at Day 21", x = "Diet")

library(ggplot2)
ggplot2::ggplot(data = ChickWeight_T21,
                aes(Diet, weight, fill = Diet)) +
  geom_boxplot() + 
  scale_y_continuous("Chick weight (gm)",
                     breaks = seq(50, 400, by = 50)) + 
  labs(title = "Distribution of weight at Day 21", x = "Diet") +
  geom_dotplot(binaxis='y', stackdir='center', dotsize=0.8)
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

To visualize the shape of the distribution with respect to different categories, we can also use the density plot.

library(ggplot2)
ggplot(ChickWeight_T21, aes(x = weight, color = Diet)) +
  geom_density(linewidth = 1)

We can have vertical lines at the mean value of each diet category. First we need to compute the means of each group.

Diet_means = c(mean(ChickWeight_T21$weight[Diet==1]),
               mean(ChickWeight_T21$weight[Diet==2]),
               mean(ChickWeight_T21$weight[Diet==3]),
               mean(ChickWeight_T21$weight[Diet==4]))
library(ggplot2)
ggplot(ChickWeight_T21, aes(x = weight, fill = Diet)) +
  geom_density(linewidth = 1)

9.2.9 Testing for Equality of variances

var.test(ChickWeight_T21$weight[Diet == 1],
         ChickWeight_T21$weight[Diet == 2])

    F test to compare two variances

data:  ChickWeight_T21$weight[Diet == 1] and ChickWeight_T21$weight[Diet == 2]
F = 0.56439, num df = 15, denom df = 9, p-value = 0.3146
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1497316 1.7624337
sample estimates:
ratio of variances 
         0.5643921 

The Levene Test (Levene 1960) can be used to check for the equality of variances for multiple groups.

library(car)
Loading required package: carData
car::leveneTest(weight ~ Diet, data = ChickWeight_T21, center = mean)
Levene's Test for Homogeneity of Variance (center = mean)
      Df F value Pr(>F)
group  3  1.2412 0.3072
      41               

9.3 The Joyner–Boore Attenuation Data

attenu data set, known as The Joyner–Boore Attenuation Data, is available in the datasets package in R. This data gives peak accelerations measured at various observation stations for 23 earthquakes in California. The data have been used by various workers to estimate the attenuating affect of distance on ground acceleration. Type help("attenu") in the console for more information and also interested readers can check out the article by (Boore and Joyner 1982).

library(datasets)
# datasets::attenu
head(attenu)
  event mag station dist accel
1     1 7.0     117   12 0.359
2     2 7.4    1083  148 0.014
3     2 7.4    1095   42 0.196
4     2 7.4     283   85 0.135
5     2 7.4     135  107 0.062
6     2 7.4     475  109 0.054
names(attenu)                 # variable names
[1] "event"   "mag"     "station" "dist"    "accel"  
dim(attenu)                   # dimension of the data
[1] 182   5
summary(attenu)               # basic summary of the data
     event            mag           station         dist       
 Min.   : 1.00   Min.   :5.000   117    :  5   Min.   :  0.50  
 1st Qu.: 9.00   1st Qu.:5.300   1028   :  4   1st Qu.: 11.32  
 Median :18.00   Median :6.100   113    :  4   Median : 23.40  
 Mean   :14.74   Mean   :6.084   112    :  3   Mean   : 45.60  
 3rd Qu.:20.00   3rd Qu.:6.600   135    :  3   3rd Qu.: 47.55  
 Max.   :23.00   Max.   :7.700   (Other):147   Max.   :370.00  
                                 NA's   : 16                   
     accel        
 Min.   :0.00300  
 1st Qu.:0.04425  
 Median :0.11300  
 Mean   :0.15422  
 3rd Qu.:0.21925  
 Max.   :0.81000  
                  
  • There are 16 observations in the station column in the data which is denoted by \(\texttt{NA}\).
  • Other variables are continuous in nature and their basic numerical summaries are provided and also they do not have any missing observations.
# suppose we remove all stations whose ID is missing
complete.cases(attenu)    # which rows are complete
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
 [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
 [97]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[109]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
[121]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[145]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[157]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[181]  TRUE  TRUE
attenu[81, ]
   event mag station dist accel
81    17 7.6    <NA> 32.9 0.064
# indices of rows with at least missing observations
miss_row = which(complete.cases(attenu) == FALSE)      
print(miss_row)
 [1]  79  81  94  96  99 107 108 114 116 118 123 126 128 155 156 160
# Creating a new data after removing the missing observations
attenu_new = attenu[ -miss_row, ] 
attenu_new$station
  [1] 117  1083 1095 283  135  475  113  1008 1028 2001 117  1117 1438 1083 1013
 [16] 1014 1015 1016 1095 1011 1028 270  280  116  266  117  113  112  130  475 
 [31] 269  135  1093 1093 111  116  290  112  113  128  126  127  141  266  110 
 [46] 1027 111  125  135  475  262  269  1052 411  290  130  272  1096 1102 112 
 [61] 113  1028 2714 2708 2715 3501 655  272  1032 1377 1028 1250 1051 1293 1291
 [76] 1292 283  885  2734 2728 1413 1445 1408 1411 1410 1409 1377 1492 1251 1422
 [91] 1376 286  5028 942  5054 958  952  5165 117  955  5055 5060 412  5053 5058
[106] 5057 5051 5115 931  5056 5059 5061 5062 5052 724  5066 5050 2316 5055 942 
[121] 5028 5165 952  958  955  117  412  5053 5054 5058 5057 5115 5056 5060 1030
[136] 1418 1383 1308 1298 1299 1219 1030 1418 1383 1299 1308 1219 1456 5045 5044
[151] 5160 5043 5047 c168 5068 c118 5042 5067 5049 c204 5070 c266 c203 5069 5073
[166] 5072
117 Levels: 1008 1011 1013 1014 1015 1016 1027 1028 1030 1032 1051 1052 ... c266
dim(attenu_new)
[1] 166   5
summary(attenu_new)
     event            mag           station         dist       
 Min.   : 1.00   Min.   :5.000   117    :  5   Min.   :  0.60  
 1st Qu.: 9.00   1st Qu.:5.300   1028   :  4   1st Qu.: 12.03  
 Median :18.00   Median :6.100   113    :  4   Median : 24.40  
 Mean   :14.31   Mean   :6.064   112    :  3   Mean   : 48.39  
 3rd Qu.:20.00   3rd Qu.:6.600   135    :  3   3rd Qu.: 49.80  
 Max.   :23.00   Max.   :7.700   475    :  3   Max.   :370.00  
                                 (Other):144                   
     accel       
 Min.   :0.0030  
 1st Qu.:0.0400  
 Median :0.1100  
 Mean   :0.1462  
 3rd Qu.:0.2000  
 Max.   :0.8100  
                 

9.3.1 Explore individual variables

par(mfrow  = c(1,2))
w = table(attenu_new$event)
barplot(w, xlab = "event ID", width = 1,
        col = "grey")
barplot(w, ylab = "event ID", width = 1,
        col = "blue", horiz = TRUE)

par(mfrow = c(1,2))
attenu_new$mag
  [1] 7.0 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 7.4 5.3 6.1 6.1 6.1 6.1 6.1 6.1
 [19] 6.1 6.1 6.1 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 5.6 5.7 5.3 5.3
 [37] 5.3 5.3 5.3 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.6
 [55] 6.6 6.6 6.6 6.6 6.6 6.6 6.6 5.3 7.7 7.7 7.7 6.2 5.6 5.6 5.2 5.2 5.2 5.2
 [73] 6.0 6.0 6.0 6.0 5.1 5.1 7.6 7.6 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8
 [91] 5.8 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5
[109] 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
[127] 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.5 5.5 5.5
[145] 5.5 5.5 5.5 5.5 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3
[163] 5.3 5.3 5.3 5.3
hist(attenu_new$mag, xlab = "Moment Magnitude", main = " ")

hist(attenu_new$mag, xlab = "Moment Magnitude", main = " ",
     probability = TRUE)

par(mfrow = c(1,2))
hist(attenu_new$mag, xlab = "Moment Magnitude", main = " ",
     probability = TRUE)
boxplot(attenu_new$mag, horizontal = TRUE, lwd = 2,
        xlab = "Moment Magnitude")

par(mfrow = c(1,2))
hist(attenu_new$dist, probability = TRUE,
     xlab = "Station-hypocenter distance (km)", main = "")
boxplot(attenu_new$dist, xlab = "Station-hypocenter distance (km)",
        horizontal = TRUE, lwd = 2, col = "darkgrey")

par(mfrow = c(1,2))
hist(attenu_new$accel, probability = TRUE,
     xlab = "Peak acceleration (g)", main = "")
boxplot(attenu_new$accel, xlab = "Peak acceleration (g)",
        horizontal = TRUE, lwd = 2, col = "darkgrey")

Important functions:

  • complete.cases, dim, which
  • summary
  • boxplot, hist, barplot

9.4 The cars dataset

The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s (Ezekiel 1930).

# datasets::cars
head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
tail(cars)
   speed dist
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85
class(cars)
[1] "data.frame"
dim(cars)
[1] 50  2
names(cars)
[1] "speed" "dist" 
complete.cases(cars)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE

The function complete.cases() returns FALSE if there is any row with at least one variable information is missing. Also try using !complete.cases().

9.4.1 Basic data exploration

Looking at individual variables is always suggested as a part of the EDA.

par(mfrow= c(2,3))
boxplot(cars$speed, col = "grey", lwd = 2,
        xlab = "speed (mph)")
hist(cars$speed, probability = TRUE, col = "grey", 
        xlab = "speed (mph)", main = "")
plot(density(cars$speed), main = "speed (mph)")
rug(jitter(cars$speed))
boxplot(cars$dist, col = "grey", lwd = 2, 
        xlab = "Stopping distance (ft)")
hist(cars$dist, probability = TRUE, col = "grey", 
        xlab = "Stopping distance (ft)", main = "")
plot(density(cars$dist), main = "Stopping distance (ft)")
rug(jitter(cars$dist))

We can use the shapiro.test() function to test whether the data is normally distributed.

shapiro.test(cars$speed)

    Shapiro-Wilk normality test

data:  cars$speed
W = 0.97765, p-value = 0.4576
shapiro.test(cars$dist)

    Shapiro-Wilk normality test

data:  cars$dist
W = 0.95144, p-value = 0.0391

From the output, at 5% level of significance, we reject the null hypothesis for the variable Stopping distance (ft) that it is normally distributed using Shapiro Wilks test Royston (1995). However, it is important to note that the \(p\)-value is 0.0391, therefore, we can not reject it at \(1\%\) level of significance.

A natural question arises how Stopping distance (ft) varies as a function of speed (mph). First let us explore the scatter plot between these two variables.

par(mfrow = c(1,1))
plot(cars$speed, cars$dist, type = "p", pch = 19,
     col = "darkgrey", ylab = "Stopping distance (ft)",
     xlab = "speed (mph)")

To explore some kind of linear relationship between Stopping distance and Speed, we first check whether there is significant correlation between them. Therefore, we test the following hypothesis for the correlation coefficient \(\rho = \mbox{Corr}(\mbox{dist, speed})\)

\[H_0\colon \rho = 0~~~~\mbox{against}~~~~H_1\colon \rho\ne 0.\] The test is performed using the function cor.test().

cor(cars$speed, cars$dist)
[1] 0.8068949
cor.test(cars$speed, cars$dist)

    Pearson's product-moment correlation

data:  cars$speed and cars$dist
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816422 0.8862036
sample estimates:
      cor 
0.8068949 
fit = lm(dist ~ speed, data = cars)
summary(fit)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
plot(cars$speed, cars$dist, type = "p", pch = 19,
     col = "darkgrey", ylab = "Stopping distance (ft)",
     xlab = "speed (mph)")
abline(fit, col = "red", lwd = 3, lty = 2)

Fitting of the simple linear regression model.
  • beta-coefficient of speed is statistically significant, which means the stopping distance depends on speed.
  • The Multiple R-squared = 0.6511, which means that approximately \(65\%\) of the variability in dist can be explained by the the variable speed through a linear function.
coefficients(fit)
(Intercept)       speed 
 -17.579095    3.932409 

In this problem, we have assumed a linear relationship between the dist and speed. However, we do not know whether the linearity assumption is a legitimate choice to approximate the true relationship (which we do not know). We do some diagnostic tests to check whether the linearity assumption is tenable.

par(mfrow = c(2,2))
plot(fit)

Regression diagnostic plot for the simple linear regression. The top panel indicates the existence of outlying observations, particularly the row numbers 23, 35 and 49 have been identified as the outlying observations.
par(mfrow = c(1,3))
error = residuals(fit)           # residuals
hist(error, probability = TRUE, xlab = expression(widehat(epsilon)),
     main = "")
boxplot(error, xlab = expression(widehat(epsilon)),
        lwd = 2, col = "darkgrey")
library(car)
qqPlot(error, ylab = expression(widehat(epsilon)))

The package has been used to draw the quantile plot the plots empirical quantiles of a variable against theoretical quantiles of a comparison distribution (here it is normal)(Fox 2016).
[1] 49 23
  • There may be existence of nonlinearity.
  • Regression diagnostics gave some outlying observations.

You can possibly remove those outliers and perform a linear regression, or you can go for a quadratic regression.

9.4.2 Quadratic fitting

We are fitting the following quadratic polynomial equation

\[\mbox{dist} = \beta_0 + \beta_1 \times \mbox{speed} + \beta_2 \times \mbox{speed}^2.\] Using the following R Codes, we can obtain the least squares estimate of the coefficients \(\beta_0, \beta_1,\beta_2\). The symbol \(\texttt{I}(\cdot)\) is used as wrapper for polynomial terms to create the formula for the lm() function.

par(mfrow = c(1,1))
fit2 = lm(dist ~ speed + I(speed^2), data = cars)
summary(fit2)

Call:
lm(formula = dist ~ speed + I(speed^2), data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.720  -9.184  -3.188   4.628  45.152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.47014   14.81716   0.167    0.868
speed        0.91329    2.03422   0.449    0.656
I(speed^2)   0.09996    0.06597   1.515    0.136

Residual standard error: 15.18 on 47 degrees of freedom
Multiple R-squared:  0.6673,    Adjusted R-squared:  0.6532 
F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12
coef(fit2)
(Intercept)       speed  I(speed^2) 
  2.4701378   0.9132876   0.0999593 
dist_hat = coef(fit2)[1] + coef(fit2)[2]*cars$speed + coef(fit2)[3]*cars$speed^2

plot(cars$speed, cars$dist, type = "p", pch = 19,
     col = "darkgrey", ylab = "Stopping distance (ft)",
     xlab = "speed (mph)")
abline(fit, col = "red", lwd = 3, lty = 2)
lines(cars$speed, dist_hat, type = "l", col= "blue",
      lwd = 3)
legend("topleft", legend = c("linear", "quadratic"),
       col = c("red", "blue"), lwd = c(3,3),
       bty = "n", lty = c(2,1))

9.4.3 Regression diagnostic for quadratic regression

par(mfrow = c(1,3))
error = residuals(fit2)
hist(error, probability = TRUE, xlab = expression(widehat(epsilon)),
     main = "")
boxplot(error, xlab = expression(widehat(epsilon)),
        lwd = 2, col = "darkgrey")
library(car)
qqPlot(error, ylab = expression(widehat(epsilon)))

[1] 23 49

We can obtain the diagnostic plot using the plot() function.

par(mfrow = c(2,2))
plot(fit2)

Regression diagnostc for the quadratic function fitting. In fact, the Q-Q Residuals plot gave more number of outliers and to some extent we get a feeling that quadratic regression might not be a good option as the there is deviance from the normality of residuals. In addition, the residuals versus fitted values plot provides a slight indication of heteroschedasticity.

There seem to be slight improvement using a quadratic equation than a linear equation. However, it can be statistically check whether this increment is statistically significant or not. The function anova() is used to compare the performance of fitting exercises using two different functions.

anova(fit, fit2)
Analysis of Variance Table

Model 1: dist ~ speed
Model 2: dist ~ speed + I(speed^2)
  Res.Df   RSS Df Sum of Sq     F Pr(>F)
1     48 11354                          
2     47 10825  1    528.81 2.296 0.1364

As per the qqPlot() function from the car package, two data points (row number 23 and 49) have been identified as outliers leading to deviation from the normality of the residuals. Let us remove them.

fit_new = lm(dist ~ speed, data = cars[-c(23, 35, 49),])
summary(fit_new)

Call:
lm(formula = dist ~ speed, data = cars[-c(23, 35, 49), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-25.032  -7.686  -1.032   6.576  26.185 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.1371     5.3053  -2.853  0.00652 ** 
speed         3.6085     0.3302  10.928    3e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.84 on 45 degrees of freedom
Multiple R-squared:  0.7263,    Adjusted R-squared:  0.7202 
F-statistic: 119.4 on 1 and 45 DF,  p-value: 3.003e-14
par(mfrow = c(1,1))
plot(cars$speed, cars$dist, type = "p", pch = 19,
     col = "darkgrey", ylab = "Stopping distance (ft)",
     xlab = "speed (mph)")
abline(fit, col = "red", lwd = 3, lty = 2)
abline(fit_new, col = "magenta", lwd = 3, lty = 2)
points(cars[c(23,35,49),], col = "red", cex = 2,
       lwd = 3, pch = 4)

The dotted magenta line indicates the fitted linear regression line after removing the three outlying observations as identified by the regression diagnostics.
par(mfrow = c(2,2))
plot(fit_new)

Regression diagnostic for the linear regression model after removing three outlying observations.

Although after removing the designated outliers from the regression diagnostic plots, we obtained a significant improvement in the \(R^2\) values (approx \(75\%\)), however, the diagnostic plot of the new regression fit, gives stronger evidence towards the existence of nonlinear relationship.

fit3 = lm(dist ~ speed + I(speed^2) + I(speed^3) , data = cars)
summary(fit3)

Call:
lm(formula = dist ~ speed + I(speed^2) + I(speed^3), data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.670  -9.601  -2.231   7.075  44.691 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -19.50505   28.40530  -0.687    0.496
speed         6.80111    6.80113   1.000    0.323
I(speed^2)   -0.34966    0.49988  -0.699    0.488
I(speed^3)    0.01025    0.01130   0.907    0.369

Residual standard error: 15.2 on 46 degrees of freedom
Multiple R-squared:  0.6732,    Adjusted R-squared:  0.6519 
F-statistic: 31.58 on 3 and 46 DF,  p-value: 3.074e-11
dist_hat = predict(fit3)
print(dist_hat)
        1         2         3         4         5         6         7         8 
 2.760981  2.760981 14.485912 14.485912 17.774747 20.856365 23.792277 23.792277 
        9        10        11        12        13        14        15        16 
23.792277 26.643997 26.643997 29.473036 29.473036 29.473036 29.473036 32.340907 
       17        18        19        20        21        22        23        24 
32.340907 32.340907 32.340907 35.309122 35.309122 35.309122 35.309122 38.439194 
       25        26        27        28        29        30        31        32 
38.439194 38.439194 41.792634 41.792634 45.430956 45.430956 45.430956 49.415670 
       33        34        35        36        37        38        39        40 
49.415670 49.415670 49.415670 53.808290 53.808290 53.808290 58.670328 58.670328 
       41        42        43        44        45        46        47        48 
58.670328 58.670328 58.670328 70.048706 76.688071 84.042903 84.042903 84.042903 
       49        50 
84.042903 92.174715 
par(mfrow = c(1,1))
plot(cars$speed, cars$dist, type = "p", pch = 19,
     col = "darkgrey", ylab = "Stopping distance (ft)",
     xlab = "speed (mph)")
abline(fit, col = "red", lwd = 2)
lines(cars$speed, predict(fit2), col = "magenta", lwd = 2)
lines(cars$speed, predict(fit3), col = "blue", lwd = 2)
legend("topleft", legend = c("linear", "quadratic", "cubic"),
       col = c("red","magenta", "blue"), lwd = c(2,2,2),
       bty = "n")

Let us check how the \(R^2\) values changes as the degree of the polynomial increases.

fit4 = lm(dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4) , 
          data = cars)
summary(fit4)

Call:
lm(formula = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4), 
    data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.701  -8.766  -2.861   7.158  42.186 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  45.845412  60.849115   0.753    0.455
speed       -18.962244  22.296088  -0.850    0.400
I(speed^2)    2.892190   2.719103   1.064    0.293
I(speed^3)   -0.151951   0.134225  -1.132    0.264
I(speed^4)    0.002799   0.002308   1.213    0.232

Residual standard error: 15.13 on 45 degrees of freedom
Multiple R-squared:  0.6835,    Adjusted R-squared:  0.6554 
F-statistic:  24.3 on 4 and 45 DF,  p-value: 9.375e-11
fit5 = lm(dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4) + I(speed^5) , 
          data = cars)
summary(fit5)

Call:
lm(formula = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4) + 
    I(speed^5), data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.370  -8.165  -2.395   7.294  42.342 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.650e+00  1.401e+02  -0.019    0.985
speed        5.484e+00  6.736e+01   0.081    0.935
I(speed^2)  -1.426e+00  1.155e+01  -0.124    0.902
I(speed^3)   1.940e-01  9.087e-01   0.214    0.832
I(speed^4)  -1.004e-02  3.342e-02  -0.300    0.765
I(speed^5)   1.790e-04  4.648e-04   0.385    0.702

Residual standard error: 15.27 on 44 degrees of freedom
Multiple R-squared:  0.6846,    Adjusted R-squared:  0.6487 
F-statistic:  19.1 on 5 and 44 DF,  p-value: 4.65e-10
R2_fit = summary(fit)$r.squared
R2_fit2 = summary(fit2)$r.squared
R2_fit3 = summary(fit3)$r.squared
R2_fit4 = summary(fit4)$r.squared
R2_fit5 = summary(fit5)$r.squared

adjR2_fit = summary(fit)$adj.r.squared
adjR2_fit2 = summary(fit2)$adj.r.squared
adjR2_fit3 = summary(fit3)$adj.r.squared
adjR2_fit4 = summary(fit4)$adj.r.squared
adjR2_fit5 = summary(fit5)$adj.r.squared

R2_vals = c(R2_fit, R2_fit2, R2_fit3, R2_fit4, R2_fit5)
adjR2_vals = c(adjR2_fit, adjR2_fit2, adjR2_fit3, 
               adjR2_fit4, adjR2_fit5)
plot(1:5, R2_vals, type = "b", pch = 19, cex = 1.3,
      lwd = 2, col = "darkgrey", ylab = expression(R^2),
     xlab = "degree (p)", ylim = c(0.6, 0.7))
lines(1:5, adjR2_vals, col = "black", lwd = 2, type = "b", pch = 19)
legend("topleft", c("R squared", "adj R squared"),
       col = c("darkgrey", "black"), lwd = c(2,2), bty = "n")

anova(fit, fit2, fit3, fit4, fit5)
Analysis of Variance Table

Model 1: dist ~ speed
Model 2: dist ~ speed + I(speed^2)
Model 3: dist ~ speed + I(speed^2) + I(speed^3)
Model 4: dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4)
Model 5: dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4) + I(speed^5)
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1     48 11354                           
2     47 10825  1    528.81 2.2671 0.1393
3     46 10634  1    190.35 0.8161 0.3712
4     45 10298  1    336.55 1.4428 0.2361
5     44 10263  1     34.59 0.1483 0.7020

9.5 Leverage and Influence Plot

par(mfrow = c(1,2))
plot(cooks.distance(fit), col = "darkgrey", pch = 19, cex = 1.1)
dfbetaPlots(fit, pch = 19, col = "grey", cex = 1.4)

9.5.1 A cross verification from the numerical method lectures

X = cbind(rep(1,nrow(cars)), cars$speed)
print(X)
      [,1] [,2]
 [1,]    1    4
 [2,]    1    4
 [3,]    1    7
 [4,]    1    7
 [5,]    1    8
 [6,]    1    9
 [7,]    1   10
 [8,]    1   10
 [9,]    1   10
[10,]    1   11
[11,]    1   11
[12,]    1   12
[13,]    1   12
[14,]    1   12
[15,]    1   12
[16,]    1   13
[17,]    1   13
[18,]    1   13
[19,]    1   13
[20,]    1   14
[21,]    1   14
[22,]    1   14
[23,]    1   14
[24,]    1   15
[25,]    1   15
[26,]    1   15
[27,]    1   16
[28,]    1   16
[29,]    1   17
[30,]    1   17
[31,]    1   17
[32,]    1   18
[33,]    1   18
[34,]    1   18
[35,]    1   18
[36,]    1   19
[37,]    1   19
[38,]    1   19
[39,]    1   20
[40,]    1   20
[41,]    1   20
[42,]    1   20
[43,]    1   20
[44,]    1   22
[45,]    1   23
[46,]    1   24
[47,]    1   24
[48,]    1   24
[49,]    1   24
[50,]    1   25
Y = cars$dist
beta_star = (solve(t(X)%*%X))%*%t(X)%*%Y
print(beta_star)
           [,1]
[1,] -17.579095
[2,]   3.932409

9.6 Questions

  • Using the airquality dataset, answer the following descriptive statistics questions:
    • Using help(airquality) command obtain basic description of the data and also check the source of the data set.
    • Calculate the mean, median, standard deviation, and range of the Ozone variable.
    • How many missing values are there in the Ozone variable?
    • Draw boxplot and histograms of each variable present in the airquality data. Using boxplot in R identify the outliers.
    • Calculate the average Temp for each month (May through September).
    • Which month recorded the highest average Temp?
    • What is the correlation between Ozone and Solar.R? Interpret the result.
    • Create a scatter plot of Wind vs. Temp. Does the plot suggest a relationship?
    • Plot a histogram of the Wind variable. Comment on the shape of the distribution.
    • What is the interquartile range (IQR) of Wind?
    • Identify the percentage of rows with missing values in the dataset.
    • Provide one strategy to handle missing values in the Ozone variable and explain its implications.
  • The rivers dataset contains the lengths (in miles) of 141 major rivers in North America.
    • What is the mean, median, and standard deviation of the river lengths?
    • What is the range (minimum and maximum) of the river lengths?
    • Create a histogram of the river lengths. What is the shape of the distribution (e.g., symmetric, skewed)?
    • How many rivers are longer than 1,000 miles?
    • Create a boxplot of the river lengths. Are there any outliers?
    • What is the interquartile range (IQR) of the river lengths?
    • What percentage of rivers fall within one standard deviation of the mean?
    • Apply a log transformation to the river lengths. How does the histogram of the transformed data compare to the original?
    • Why might a log transformation be useful for this dataset?
    • What is the mean length of the 10 shortest rivers?
    • What is the mean length of the 10 longest rivers?
    • How does the variability (standard deviation) compare between the shortest and longest rivers?
    • Create a density plot of the river lengths. How does it compare to the histogram?
    • Overlay a rug plot on the histogram. Does it provide additional insights about the data distribution?
    • Compare the distribution of river lengths to a normal distribution using a Q-Q plot. What do you observe?
    • Test the normality of the river length data using the Shapiro-Wilk test. What is the result?
    • Identify any regions or patterns in the data that suggest natural groupings or clusters of river lengths.
  • The EuStockMarkets dataset contains daily closing prices of major European stock indices from 1991 to 1998.
    • What are the mean, median, standard deviation, and range of closing prices for each stock index (DAX, SMI, CAC, and FTSE)?
    • Which stock index has the highest average closing price over the entire dataset?
    • What is the minimum and maximum closing price for each index?
    • Create histograms for the closing prices of each stock index.
    • Which indices show skewed distributions?
    • Plot density plots for the indices. Do they suggest unimodal or multimodal distributions?
    • Use boxplots to compare the distributions of the four stock indices. Are there any outliers?
    • (Temporal trends) Plot the time series of the closing prices for each index. What trends do you observe?
    • Identify the periods of significant increases or decreases for each stock index.
    • Which index shows the highest volatility (largest fluctuations) over time?
    • (Correlation analysis) Compute the pairwise correlations between the indices. Which pair of indices has the strongest correlation?
    • Create a scatterplot matrix of the indices. Do the relationships appear linear?
    • Conduct a hypothesis test to determine if the correlation between DAX and CAC is significantly different from zero.
    • (Volatility) Calculate the daily returns (\((P_t - P_{t-1})/P_{t-1})\) for each index. Which index has the highest average daily return?
    • Plot the time series of daily returns. Which index shows the most frequent large spikes?
    • Identify the days with the largest positive and negative returns for each index.
    • Use Q-Q plots to assess whether the closing prices of each index follow a normal distribution. Which indices deviate most from normality?
    • Suggest real-world economic events during the dataset period (1991–1998) that might have influenced the trends in these indices.
  • The CO2 dataset contains data on carbon dioxide uptake in grass plants under varying environmental conditions. The CO2 uptake of six plants from Quebec and six plants from Mississippi was measured at several levels of ambient CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.
    • What is the mean, median, standard deviation, and range of the variable uptake?
    • What is the mean and standard deviation of conc across all plants?
    • Which plant has the highest carbon dioxide uptake (uptake)?
    • (Comparison among groups) What is the average uptake for each Type of plant (Quebec and Mississippi)?
    • Compare the mean and median uptake for plants under different Treatment conditions (nonchilled vs chilled).
    • Calculate the mean uptake for each combination of Type and Treatment.
    • Which group has the highest mean uptake?
    • (Understanding individual distributions) Create histograms of uptake for the two Type groups (Quebec and Mississippi).
    • How do the distributions differ?
    • Plot a boxplot of uptake by Treatment. Are there any outliers?
    • Create density plots of uptake for the two Type groups. What do you observe about their shapes?
    • Create a scatter plot of uptake vs. conc. What kind of relationship do you observe?
    • Fit a linear model of uptake as a function of conc. What does the slope of the line indicate?
    • Does the relationship between uptake and conc differ across Type or Treatment? Visualize and explain.
    • (Advanced Visualizations) Create a faceted scatterplot of uptake vs. conc by Type and Treatment. What patterns emerge?
    • Create a heatmap of the average uptake for each combination of Type, Treatment, and conc ranges. What does this visualization reveal?
    • Plot boxplots of uptake grouped by individual Plant. Which plants have the highest median and variability?
    • (Transformation) Apply a log transformation to uptake. How does the distribution change?
    • Group conc into intervals (e.g., 0–200, 201–400, etc.) and calculate the average uptake for each interval. What trend do you observe?
DQ = CO2[1:42, ]
DM = CO2[43:84, ]
par(mfrow = c(1,2))
plot(DQ$conc, DQ$uptake, type = "p", 
     col = ifelse(DQ$Treatment =="nonchilled", "red", "magenta"), 
     pch = 19, cex = 1.2, main = "Quebec",
     xlab = "Conc", ylab = "Uptake")
legend("bottomright", legend = c("nonchilled", "chilled"),
       col = c("red", "magenta"), pch = c(19, 19),
       cex = c(1.2, 1.2), bty = "n")
plot(DM$conc, DM$uptake, type = "p", 
     col = ifelse(DQ$Treatment =="nonchilled", "red", "magenta"), 
     pch = 19, cex = 1.2, main = "Mississippi",
     xlab = "Conc", ylab = "Uptake")
legend("bottomright", legend = c("nonchilled", "chilled"),
       col = c("red", "magenta"), pch = c(19, 19),
       cex = c(1.2, 1.2), bty = "n")