An introduction to R, RStudio, and R Markdown with GIFs!

The development of the bookdown package from RStudio in the summer of 2016 has facilitated greatly the ability of educators to create open-source materials for their students to use. It expands to more than just academic settings though and it encourages the sharing of resources and knowledge in a free and reproducible way.

As more and more students and faculty begin to use R in their courses and their research, I wanted to create a resource for the complete beginner to programming and statistics to more easily learn how to work with R. Specifically, the book includes GIF screen recordings that show the reader what specific panes do in RStudio and also the formatting of an R Markdown document and the resulting HTML file.

Folks who have used a programming language for awhile often forget about all the troubles they had when they initially got started with it. To further support this, I’ll be working on updating the book (specifically Chapter 6) with examples of common R errors, what they mean, and how to remedy them.

The book is entitled “Getting Used to R, RStudio, and R Markdown” and can be found at http://ismayc.github.io/rbasics-book. All of the source code for the book is available at http://github.com/ismayc/rbasics. You can also request edits to the book by clicking on the Edit button near the top of the page. You’ll also find a PDF version of the book there with links to the GIFs (since PDFs can’t have embedded GIFs like webpages can).

Chapter 5 of the book walks through some of the basics of R in working with a data set corresponding to the elements of the periodic table. To expand on this book and on using R in an introductory statistics setting, I’ve also embarked on creating a textbook using bookdown focused on data visualization and resampling techniques to build inferential concepts. The book uses dplyr and ggplot2 packages and focuses on two main data sets in the nycflights13 and ggplot2movies packages. Chapters 8 and 9 are in development, but the plan is for an introduction to the broom package to also be given there. Lastly, there will be expanded Learning Checks throughout the book and Review Questions at the end of each chapter to help the reader better check their understanding of the material. This book is available at http://ismayc.github.io/moderndiver-book with source code available here.

Feel free to email me or tweet to me on Twitter @old_man_chester.

Advertisements
Posted in R, Reproducible research | Leave a comment

Updated R Markdown thesis template

In October of 2015, I released an R Markdown senior thesis template R package and discussed it in the blogpost here. It was well-received by students and faculty that worked with it and this past summer I worked on updating it to make it even nicer for students. The big addition is the ability for students to export their senior thesis to a webpage (example here) and also label and cross-reference figures and tables more easily. These additions and future revisions will be in the new thesisdown package in the spirit of the bookdown package developed and released by RStudio in summer 2016.

I encourage you to look over my blog post last year to get an idea of why R Markdown is such a friendly environment to work in. Markdown specifically allows for typesetting of the finished document to be transparent inside the actual document. Down the road, it is my hope that students will be able to write generating R Markdown files that will then export into many formats. These currently include the LaTeX format to produce a PDF following Reed's senior thesis guidelines and the HTML version in gitbook style. Eventually, this will include a Word document following Reed's guidelines and also an ePub (electronic book) version. These last two are available at the moment but are not fully functional.

By allowing senior theses in a variety of formats, seniors will be more easily able to display their work to potential employers, other students, faculty members, and potential graduate schools. This will allow them to get the word out about their studies and research while still encouraging reproducibility in their computations and in their analyses.

Install the template generating package

To check out the package yourself, make sure you have RStudio and LaTeX installed and then direct your browser to the GitHub page for the template: http://github.com/ismayc/thesisdown. The README.md file near the bottom of the page below the files gives directions on installing the template package and getting the template running. As you see there, you'll want to install the thesisdown package via the following commands in the RStudio console:

install.packages("devtools")
devtools::install_github("ismayc/thesisdown")

If you have any questions, feedback, or would like to report any issues, please email me.

(The generating R Markdown file for this HTML document—saved in the .Rmd extension—is available here.)

Posted in R, Reproducible research | 11 Comments

R Markdown senior thesis template

“Science is reportedly in the middle of a reproducibility crisis.” This is the claim of quite a few these days including an article from ROpenSci which directly references another article by The Conversation. But what is “reproducible research” and how can statistical tools be used to help facilitate it?

I agree with Roger Peng and the folks behind Coursera's massively popular Data Science Specialization and the front page of their Reproducible Research course on their definition of “reproducible research”:

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.

One of the current goals at Reed is for students to be engaged in the research process earlier on in their academic careers in order to make their senior thesis experience more meaningful and rewarding. The collecting and analysis of data has been a challenging part of this process in the past for students. Additionally, updating statistical analyses, plots, and bibliographies as advisors request has sometimes been time consuming and frustrating.

So how does this tie into the reproducible research concept? As more and more students use R Markdown while taking courses at Reed (Chem 101/102, Mathematics 141, Mathematics 243, and Bio 101/102 next year!), it became clear that an option to use R Markdown while generating the senior thesis document should be available. The simplicity of Markdown commands improves readability and documentation, which make it a great thesis writing environment. R Markdown provides a wonderful environment to publish data and software code along with text and commentary and, in my opinion, is the best software currently available for writing journal articles, homework assignments, AND senior theses….reproducible senior theses!

What I'll be discussing in this blog post is a template I've created using R Markdown for Reed College senior theses. This template derives a lot of the features of the current LaTeX template (in fact it directly calls this template) so you can expect many of the great things from that including automatic creation of figure and table numbers, a table of contents, easy addition of beautiful plots and graphics, and the use of bibiolography style files. (Do you really want to have to memorize what APA or MLA looks like?…No!) In addition, you won't need to learn LaTeX. I've done the hard work of getting all that set up. You only need to learn Markdown. (Don't worry. That's really not hard!)

Markdown

Markdown allows you to write in an easy-to-read, easy-to-write plain text format, which then converts to a variety of different formats. R Markdown, which is really just Markdown with the ability to add in R code and output directly, allows you to produce HTML, PDF, and even Word documents based on only one Markdown file! The sky is the limit for Markdown!

The thesis template will automatically convert the Markdown code you type to LaTeX code which will then produce a PDF document. Again, I've done the hard work of getting this Markdown code to produce the correct output. If you'd like to check out the current version of this PDF, it is available on Google Drive here.

Now, back to Markdown basics. You can find a lot more information on Markdown here and an interactive tutorial here. You'll get used to it before you know it. Probably in less than an hour!

Install the template generating package

After you've stepped through the basics of Markdown, it's time to check out this template for yourself. Make sure you have RStudio and LaTeX installed and then direct your browser to the GitHub page for the template: http://github.com/ismayc/rticles. You'll see instructions there in the README.md file near the bottom of the page on installing the template package and getting the template running. As you see there, you'll want to install the rticles package via the following commands in the RStudio console:

install.packages("devtools")
devtools::install_github("ismayc/rticles")

This allows for Reed Senior Thesis to be an option when you create a new R Markdown file via the File -> New File -> R Markdown dialog. You can specify the name you would like to give the file and also the location where you'd like the template and its files to be stored. I've called it “MyReedThesis” in what follows. After hitting the OK button, you'll see an Rmd file load up onto your RStudio window. (It will be called the same thing as what you specified in the previous step.)

If you click on the folder you've just created in the Files tab in RStudio, you can see a lot of files that have been created to assist you. You will see Rmd files for the abstract, bibliography, different chapters, conclusion, and your main Rmd file (MyReedThesis.Rmd for me). You also see folders that hold your bibliography database files (bib), bibliography style files (csl), data files (data), and images (figure). You'll find that I've preloaded a bibliography database file (thesis.bib), an American Psychological Association (APA) style file (apa.csl) from the Zotero Style Repository, a dataset derived from departing flights from Seattle and Portland in 2014 (flights.csv), and a couple figures in the figure folder.

If you click on the Knit button near the top of the RStudio window on your main Rmd file, a PDF linking together all of the files will be created. I've tried to show you how to do a variety of things in this Markdown template including

In the Introduction

  • Creating a different headers including a chapter using the # syntax, a section using ##, a subsection using ###, etc.
  • Bolding of text by surrounding the text in ** or __
  • Adding a comment using the <!-- and --> syntax

In Chapter 1

  • Easily creating numbered or non-numbered lists
  • Using whitespace to create a new paragraph in your output
  • Including R code in chunks to be added into your document
  • Specifying different chunk options to tweak the output
  • Using inline R code for calculations and directly referring to R results in the text of your document
  • Including plots created in R
  • Creating tables based on data stored in R
  • Creating hyperlinks to web resources

In Chapter 2

  • Including mathematical equations
  • Using LaTeX to create chemical reaction equations
  • Adding other discipline specific content to your theses

In Chapter 3

  • Creating tables using Pandoc
  • Labelling and referencing tables and figures (both created in R and stored in files) using custom R functions I created
  • Adding footnotes into your document
  • Developing bibliography database files using Zotero (highly recommended) or other programs like BibDesk

In the Conclusion

  • Making a few tweaks to how chapters are displayed
  • Creating appendices if desired

Lastly, in the bibliography.Rmd file, you'll find ways to currently fix the hanging indent issue with some citation styles. (You might need to delete a few lines if your style doesn't require hanging indents.) I also provide a way to cite sources in your bibliography at the end of the document that you don't directly cite throughout your thesis here.

The main driver R Markdown file

If you go back to the main Rmd file that is open, you'll see a lot of text that is commented out. This provides more information about what is happening in this file. In summary, this file links all of the chapter files together and creates a way to input the preliminaries (the abstract, the preface, etc.) as either inline text at the top of the document or in a file stored like abstract.Rmd. You'll also change your name, your advisor's name, and other metadata here. One last thing: lot stands for “List of Tables” and lof stands for “List of Figures.” These are automatically created when you include a table or a figure into your theses.

Currently, everything links throughout the document as well. When you make a reference to Figure 3.1 in Chapter 2, you can click on that link in the PDF and it will go directly to that figure. Chapters and sections are also linked in this same fashion.

There will be slight modifications (hopefully only slight if any at all!) occurring to this document as students and faculty test it throughout this year. I highly recommend running devtools::install_github("ismayc/rticles") and creating a new template from time to time to make sure you have the latest version of the template. You can easily copy your files into the new template as needed. (Some of the changes may occur to LaTeX files that you won't directly access.)

I'm hopeful that this template will be of great use to seniors throughout Reed College this year and in years to come. Science majors may get the most immediate benefit, but I believe that Markdown is an incredible valuable language to learn. It's easy and extremely flexible in the type of output it can produce. I've written this template to be as user friendly as possible and I hope that non-science majors will consider using it! If you have any questions, feedback, or would like to report any issues, please email me.

(The generating R Markdown file for this HTML document—saved in the .Rmd extension—is available here.)

Posted in R, Reproducible research | Leave a comment

Creating nice tables using R Markdown

One of the neat tools available via a variety of packages in R is the creation of beautiful tables using data frames stored in R. In what follows, I’ll discuss these different options using data on departing flights from Seattle and Portland in 2014. (More information and the source code for this R package is available at https://github.com/ismayc/pnwflights14.)

We begin by ensuring the needed packages are installed and then load them into our R session.

# List of packages required for this analysis
pkg <- c("dplyr", "knitr", "devtools", "DT", "xtable")

# Check if packages are not installed and assign the
# names of the packages not installed to the variable new.pkg
new.pkg <- pkg[!(pkg %in% installed.packages())]

# If there are any packages in the list that aren't installed,
# install them
if (length(new.pkg)) {
  install.packages(new.pkg, repos = "http://cran.rstudio.com")
}

# Load the packages into R
library(dplyr)
library(knitr)
library(DT)
library(xtable)

# Install Chester's pnwflights14 package (if not already)
if (!require(pnwflights14)){
  library(devtools)
  devtools::install_github("ismayc/pnwflights14")
  }
library(pnwflights14)

# Load the flights dataset
data("flights", package = "pnwflights14")

The dataset provides for the development of a lot of interesting questions. Here I will delve further into some of the questions I addressed in two recent workshops I led in the Fall 2015 Data @ Reed Research Skills Workshop Series. (Slides available at http://rpubs.com/cismay.)

The questions I will analyze by creating tables are

  1. Which destinations had the worst arrival delays (on average) from the two PNW airports?
  2. How does the maximum departure delay vary by month for each of the two airports?
  3. How many flights departed for each airline from each of the airports?

The kable function in the knitr package

To address the first question, we will use the dplyr package written by Hadley Wickham as below. We’ll use the top_n function to isolate the 5 worst mean arrival delays.

worst_arr_delays <- flights %>% group_by(dest) %>%
  summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(mean_arr_delay)) %>%
  top_n(n = 5, wt = mean_arr_delay)

This information is helpful but you may not necessarily know to which airport each of these FAA airport codes refers. One of the other data sets included in the pnwflights14 package is airports that lists the names. Here we will do a match to identify the names of these airports using the inner_join function in dplyr.

data("airports", package = "pnwflights14")
joined_worst <- inner_join(worst_arr_delays, airports, by = c("dest" = "faa")) %>%
  select(name, dest, mean_arr_delay) %>%
  rename("Airport Name" = name, "Airport Code" = dest, "Mean Arrival Delay" = mean_arr_delay)

Lastly we output this table cleanly using the kable function.

kable(joined_worst)
Airport Name Airport Code Mean Arrival Delay
Cleveland Hopkins Intl CLE 26.150000
William P Hobby HOU 10.250000
Metropolitan Oakland Intl OAK 10.067460
San Francisco Intl SFO 8.864937
Bellingham Intl BLI 8.673913

Oddly enough, flights to Cleveland (from PDX and SEA) had the worst arrival delays in 2014. Houston also had around a 10 minute delay on average. Surprisingly, the airport in Bellingham, WA (only around 100 miles north of SEA) had the fifth largest mean arrival delay.

The DT package

In order to answer the second question, we’ll again make use of the various functions in the dplyr package.

dep_delays_by_month <- flights %>% group_by(origin, month) %>%
summarize(max_delay = max(dep_delay, na.rm = TRUE))

The DT package provides a nice interface for viewing data frames in R. I’ve specified a few extra options here to show all 12 months by default and to automatically set the width. Go ahead and play around with the filter boxes at the top of each column too. (An excellent tutorial on DT is available at https://rstudio.github.io/DT/.)

datatable(dep_delays_by_month,
          filter = 'top', options = list(
            pageLength = 12, autoWidth = TRUE
          ))

The created table in HTML is available here.

If you click on the max_delay column header, you should see that the maximum departure delay for PDX was in March and for Seattle was in May.

The xtable package to produce nice tables in a PDF

Again, we find ourselves using the extremely helpful dplyr package to answer this question and to create the underpinnings of our table to display. We merge the flights data with the airlines data to get the names of the airlines from the two letter carrier code.

data("airlines", package = "pnwflights14")
by_airline <- flights %>% group_by(origin, carrier) %>%
  summarize(count = n()) %>%
  inner_join(x = ., y = airlines, by = "carrier") %>%
  arrange(desc(count))

The xtable package and its xtable function (and also the kable function you saw earlier) provide the functionality to generate HTML code or LaTeX code to produce a table. We will focus on producing the LaTeX code in this example.

print(xtable(by_airline),
      comment = FALSE)

\begin{table}[ht]
\centering
\begin{tabular}{rllrl}
\hline
& origin & carrier & count & name \
\hline
1 & PDX & AS & 12844 & Alaska Airlines Inc. \
2 & PDX & WN & 11193 & Southwest Airlines Co. \
3 & PDX & OO & 9841 & SkyWest Airlines Inc. \
4 & PDX & UA & 6061 & United Air Lines Inc. \
5 & PDX & DL & 5168 & Delta Air Lines Inc. \
6 & PDX & US & 2361 & US Airways Inc. \
7 & PDX & AA & 2187 & American Airlines Inc. \
8 & PDX & F9 & 1362 & Frontier Airlines Inc. \
9 & PDX & B6 & 1287 & JetBlue Airways \
10 & PDX & VX & 666 & Virgin America \
11 & PDX & HA & 365 & Hawaiian Airlines Inc. \
12 & SEA & AS & 49616 & Alaska Airlines Inc. \
13 & SEA & WN & 12162 & Southwest Airlines Co. \
14 & SEA & DL & 11548 & Delta Air Lines Inc. \
15 & SEA & UA & 10610 & United Air Lines Inc. \
16 & SEA & OO & 8869 & SkyWest Airlines Inc. \
17 & SEA & AA & 5399 & American Airlines Inc. \
18 & SEA & US & 3585 & US Airways Inc. \
19 & SEA & VX & 2606 & Virgin America \
20 & SEA & B6 & 2253 & JetBlue Airways \
21 & SEA & F9 & 1336 & Frontier Airlines Inc. \
22 & SEA & HA & 730 & Hawaiian Airlines Inc. \
\hline
\end{tabular}
\end{table}

If you don’t know LaTeX, I’ve also duplicated a similar table using kable for you to compare:

kable(by_airline)
origin carrier count name
PDX AS 12844 Alaska Airlines Inc.
PDX WN 11193 Southwest Airlines Co.
PDX OO 9841 SkyWest Airlines Inc.
PDX UA 6061 United Air Lines Inc.
PDX DL 5168 Delta Air Lines Inc.
PDX US 2361 US Airways Inc.
PDX AA 2187 American Airlines Inc.
PDX F9 1362 Frontier Airlines Inc.
PDX B6 1287 JetBlue Airways
PDX VX 666 Virgin America
PDX HA 365 Hawaiian Airlines Inc.
SEA AS 49616 Alaska Airlines Inc.
SEA WN 12162 Southwest Airlines Co.
SEA DL 11548 Delta Air Lines Inc.
SEA UA 10610 United Air Lines Inc.
SEA OO 8869 SkyWest Airlines Inc.
SEA AA 5399 American Airlines Inc.
SEA US 3585 US Airways Inc.
SEA VX 2606 Virgin America
SEA B6 2253 JetBlue Airways
SEA F9 1336 Frontier Airlines Inc.
SEA HA 730 Hawaiian Airlines Inc.

With the originating airport duplicating across all of the airlines, it would be nice if we could reduce this duplication and just bold PDX or SEA and have each appear once. Awesomely enough, the rle function in R will be of great help to us in this endeavor. It counts how many times a value is repeated in a table. We will then make a call to the multirow function in LaTeX in a sneaky way of pasting the appropriate text in addition to using the force option for sanitizing the text into LaTeX.

We add in a few options to make the output of the table a little nicer by specifying horizontal lines and removing the default rownames.

rle.lengths <- rle(by_airline$origin)$lengths
first <- !duplicated(by_airline$origin)
by_airline$origin[!first] <- ""
by_airline$origin[first] <- paste("\\multirow{", 
                                  rle.lengths,
                                  "}{*}{\\textbf{",
                                  by_airline$origin[first], "}}")

print(xtable(by_airline),
              comment = FALSE,
              hline.after=c(-1,0,nrow(by_airline), 11),
              sanitize.text.function = force,
              include.rownames = FALSE)

The resulting table produced by LaTeX can be found at Overleaf.com at https://www.overleaf.com/read/wvrpxpwrbvnk.

We see that Alaska Airlines had the most flights out of both airports with Southwest coming in second at both airports.

(The generating R Markdown file for this HTML document—saved in the .Rmd extension—is available here.)

Posted in R, Tables | Leave a comment