class: center, middle, inverse, title-slide # Summarizing Data Part 2 ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. and Angela Lui, Ph.D. ### September 13, 2023 --- # Announcements Due to scheduling conflict, next week's meetup will be on Tuesday, September 19th, at 7:00pm. --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- class: middle # Grammer of Graphics .center[ <img src="images/ggplot2_masterpiece.png" height="550" /> ] --- # Data Visualizations with ggplot2 <img src="images/hex/ggplot2.png" class="title-hex"> * `ggplot2` is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics. * `ggplot2` is, in general, more flexible for creating "prettier" and complex plots. * Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) `ggplot2` has at least three ways of creating plots: 1. `qplot` 2. `ggplot(...) + geom_XXX(...) + ...` 3. `ggplot(...) + layer(...)` * We will focus only on the second. --- # Parts of a `ggplot2` Statement <img src="images/hex/ggplot2.png" class="title-hex"> * Data `ggplot(myDataFrame, aes(x=x, y=y))` * Layers `geom_point()`, `geom_histogram()` * Facets `facet_wrap(~ cut)`, `facet_grid(~ cut)` * Scales `scale_y_log10()` * Other options `ggtitle('my title')`, `ylim(c(0, 10000))`, `xlab('x-axis label')` --- # Lots of geoms <img src="images/hex/ggplot2.png" class="title-hex"> ```r ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))] ``` ``` ## [1] "geom_abline" "geom_area" "geom_bar" ## [4] "geom_bin_2d" "geom_bin2d" "geom_blank" ## [7] "geom_boxplot" "geom_col" "geom_contour" ## [10] "geom_contour_filled" "geom_count" "geom_crossbar" ## [13] "geom_curve" "geom_density" "geom_density_2d" ## [16] "geom_density_2d_filled" "geom_density2d" "geom_density2d_filled" ## [19] "geom_dotplot" "geom_errorbar" "geom_errorbarh" ## [22] "geom_freqpoly" "geom_function" "geom_hex" ## [25] "geom_histogram" "geom_hline" "geom_jitter" ## [28] "geom_label" "geom_line" "geom_linerange" ## [31] "geom_map" "geom_path" "geom_point" ## [34] "geom_pointrange" "geom_polygon" "geom_qq" ## [37] "geom_qq_line" "geom_quantile" "geom_raster" ## [40] "geom_rect" "geom_ribbon" "geom_rug" ## [43] "geom_segment" "geom_sf" "geom_sf_label" ## [46] "geom_sf_text" "geom_smooth" "geom_spoke" ## [49] "geom_step" "geom_text" "geom_tile" ## [52] "geom_violin" "geom_vline" ``` --- # Data Visualization Cheat Sheet <img src="images/hex/ggplot2.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf'><img src='images/data-visualization-2.1.png' width='700' /></a> ] --- # Scatterplot <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability) ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Boxplots <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Boxplots (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Boxplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Histograms <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Histograms (cont.)<img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + scale_x_log10() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Histograms (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + facet_wrap(~ availability) ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Density Plots <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density() ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # `ggplot2` aesthetics <img src="images/hex/ggplot2.png" class="title-hex"> .center[ <a href='images/ggplot_aesthetics_cheatsheet.png' target='_new'> <img src='images/ggplot_aesthetics_cheatsheet.png' height='550' /></a> ] --- # Likert Scales <img src="images/hex/likert.png" class="title-hex"> Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree). ```r library(likert) library(reshape) data(pisaitems) items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q'] items24 <- rename(items24, c( ST24Q01="I read only if I have to.", ST24Q02="Reading is one of my favorite hobbies.", ST24Q03="I like talking about books with other people.", ST24Q04="I find it hard to finish books.", ST24Q05="I feel happy if I receive a book as a present.", ST24Q06="For me, reading is a waste of time.", ST24Q07="I enjoy going to a bookstore or a library.", ST24Q08="I read only to get information that I need.", ST24Q09="I cannot sit still and read for more than a few minutes.", ST24Q10="I like to express my opinions about books I have read.", ST24Q11="I like to exchange books with my friends.")) ``` --- # `likert` R Package <img src="images/hex/likert.png" class="title-hex"> ```r l24 <- likert(items24) summary(l24) ``` ``` ## Item low neutral ## 10 I like to express my opinions about books I have read. 41.07516 0 ## 5 I feel happy if I receive a book as a present. 46.93475 0 ## 8 I read only to get information that I need. 50.39874 0 ## 7 I enjoy going to a bookstore or a library. 51.21231 0 ## 3 I like talking about books with other people. 54.99129 0 ## 11 I like to exchange books with my friends. 55.54115 0 ## 2 Reading is one of my favorite hobbies. 56.64470 0 ## 1 I read only if I have to. 58.72868 0 ## 4 I find it hard to finish books. 65.35125 0 ## 9 I cannot sit still and read for more than a few minutes. 76.24524 0 ## 6 For me, reading is a waste of time. 82.88729 0 ## high mean sd ## 10 58.92484 2.604913 0.9009968 ## 5 53.06525 2.466751 0.9446590 ## 8 49.60126 2.484616 0.9089688 ## 7 48.78769 2.428508 0.9164136 ## 3 45.00871 2.328049 0.9090326 ## 11 44.45885 2.343193 0.9609234 ## 2 43.35530 2.344530 0.9277495 ## 1 41.27132 2.291811 0.9369023 ## 4 34.64875 2.178299 0.8991628 ## 9 23.75476 1.974736 0.8793028 ## 6 17.11271 1.810093 0.8611554 ``` --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ```r plot(l24) ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ```r plot(l24, type='heat') ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ```r plot(l24, type='density') ``` <img src="02-Summarizing_Data2_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # Pie Charts There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer. <center><img src='images/Pie.png' width='500'></center> --- # Pie Charts There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer. <center><img src='images/Pie.png' width='500'></center> <center><img src='images/Bar.png' width='500'></center> Source: [https://en.wikipedia.org/wiki/Pie_chart](https://en.wikipedia.org/wiki/Pie_chart). --- class: middle # Just say NO to pie charts! .font150[ "There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"] .right[.font130[John Tukey]] --- # Additional Resources For data wrangling: * `dplyr` website: https://dplyr.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html * Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome * Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf For data visualization: * `ggplot2` website: https://ggplot2.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/data-visualisation.html * R Graphics Cookbook: https://r-graphics.org * Data visualization cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf --- # One Minute Paper Complete the one minute paper: https://forms.gle/ngYXfC6jwY3TV6FXA 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you?