2 Introduction

2.1 R and RStudio

2.1.1 What is R?

R

  • is a free, open source program for statistical computing and data visualization.
  • is cross-platform (e.g., available on Windows, Mac OS, and Linux).
  • is maintained and regularly updated by the Comprehensive R Archive Network (CRAN).
  • is capable of running all types of statistical analyses.
  • has amazing visualization capabilities (high-quality, customizable figures).
  • enables reproducible research.
  • has many other capablities, such as web programming.
  • supports user-created packages (currently, more than 10,000)

2.1.2 What is RStudio?

RStudio

  • is a free program available to control R.
  • provides a more user-friendly interface for R.
  • includes a set of tools to help you be more productive with R, such as:
    • A syntax-highlighting editor for highlighting your R codes
    • Functions for helping you type the R codes (auto-completion)
    • A variety of tools for creating and saving various plots (e.g., histograms, scatterplot)
    • A workspace management tool for importing or exporting data

2.1.3 Download and Install

To benefit from RStudio, both R and RStudio should be installed in your computer. R and RStudio are freely available from the following websites:

To download and install R:

  1. Go to https://cran.r-project.org/
  2. Click “Download R for Mac/Windows”"
  3. Download the appropriate file: • Windows users click Base, and download the installer for the latest R version • Mac users select the file R-3.X.X.pkg that aligns with your OS version
  4. Follow the instructions of the installer.

To download and install RStudio:

  1. Go to https://www.rstudio.com/products/rstudio/download/
  2. Click “Download” under RStudio Desktop - Open Source License
  3. Select the install file for your OS
  4. Follow the instructions of the installer.

2.1.4 Preview of RStudio

After you open RStudio, you should see the following screen:

Opening screen of RStudio


I personally prefer console on the top-left, source on the top-right, files on the bottom-left, and environment on the bottom-right. The pane layout can be updated using Global Options under Tools.

Step 1: Click Tools and then select Global Options


Step 2: Select console, source, environment, or files for each pane


We can also change the appearance (e.g., code highlighting, font type, font size, etc.):

Step 3: Change the Appearance Settings, as you wish


Note: To get yourself more familiar with RStudio, I recommend you to check out the RStudio cheatsheet and Oscar Torres-Reyna’s nice tutorial (Note: You can click on these links to open and download the documents or see https://github.com/okanbulut/rbook/tree/master/cheatsheets).

2.1.5 Creating a New Script

In R, we can type our commands in the console; but once we close R, everything we have typed will be gone. Therefore, we should create an empty script, write the codes in the script, and save it for future use. We can replicate the exact same analysis and results by running the script again later on. The R script file has the .R extension, but it is essentially a text file. Thus, any text editor (e.g., Microsoft Word, Notepad, TextPad) can be used to open a script file for editing outside of the R environment.

We can create a new script file in R as follows:

Creating a new script in R (using RStudio)


When we type some codes in the script, we can select the lines we want to run and then hit the run button. Alternatively, we can bring the cursor at the beginning of the line and hit the run button which runs one line at a time and moves to the next line.

Running R codes from a script file

2.1.6 Working Directory

An important feature of R is “working directory,” which refers to a location or a folder in your computer where you keep your R script, your data files, etc. Once we define a working directory in R, any data file or script within that directory can be easily imported into R without specifying where the file is located. By default, R chooses a particular location in your computer (typically Desktop or Documents) as your working director. To see our current working director, we need to run a getwd() command in the R console:

getwd()

This will return a path like this:

## [1] "C:/Users/bulut/Desktop"

Once we decide to change the current working direcory into a different location, we can do it in two ways:

Method 1: Using the “Session” options menu in RStudio

We can select Session > Set Working Directory > Choose Directory to find a folder or location that we want to set as our current working directory.


Method 1: Setting the working directory in R


Method 2: Using the setwd command in the console

Tpying the following code in the console will set the “R workshop” folder on my desktop as the working directory. If the folder path is correct, R changes the working directory without giving any error messages in the console.

setwd("C:/Users/bulut/Desktop/R workshop")

To ensure that the working directory is properly set, we can use the getwd() command again:

getwd()
## [1] "C:/Users/bulut/Desktop/R workshop"

IMPORTANT: R does not accept any backslashes in the file path. Instead of a backslash, we need to use a frontslash. This is particulary important for Windows computers since the file paths involve backslashes (Mac OS X doesn’t have this problem).

2.1.7 Downloading and Installing R Packages

The base R program comes with many built-in functions to compute a variety of statistics and to create graphics (e.g., histograms, scatterplots, etc.). However, what makes R more powerful than other software programs is that R users can write their own functions, put them in a package, and share it with other R users via the CRAN website.

For example, ggplot2 (Wickham et al. 2020) is a well-known R package, created by Hadley Wickham and Winston Chang. This package allows R users to create elegant data visualizations. To download and install the ggplot2 package, we need to use the install.packages command. Note that your computer has to be connected to the internet to be able to connect to the CRAN website and download the package.

install.packages("ggplot2")

Once a package is downloaded and installed, it is permanently in your R folder. That is, there is no need to re-install it, unless you remove the package or install a new version of R. These downloaded packages are not directly accessible until we activate them in your R session. Whenever we need to access a package in R, we need to use the library command to activate it. For example, to access the ggplot2 package, we would use:

library("ggplot2")

To get help on installed packages (e.g., what’s inside this package):

# To get details regarding contents of a package
help(package = "ggplot2")

# To list vignettes available for a specific package
vignette(package = "ggplot2")

# To view specific vignette
vignette("ggplot2-specs")

2.1.8 Exercise 1

  1. Open RStudio and set the folder that you have the training materials (it’s called rtraining) as your working directory using either the setwd command or the Session options menu in RStudio.

  2. Open the R script file called ttc-r-course in the rtraining folder. You can open the script by either double-clicking on the file (so RStudio opens it automatically) or using “File” and “Open file” in RStudio.

  3. Install and activate the lattice package using the install.packages and library commands. The lattice package (Sarkar 2020) is another well-known package for data visualization in R. You should type the following in your script file, choose all the lines, and hit the run button.

install.packages("lattice")
library("lattice")

2.2 Basics of the R Language

2.2.1 Creating New Variables

To create a new variable in R, we use the assignment operator, <-. To create a variable x that equals 25, we need to type:

x <- 25

If we want to print x, we just type x in the console and hit enter. R returns the value assigned to x.

x
[1] 25

We can also create a variable that holds multiple values in it, using the c command (c stands for combine).

weight <- c(60, 72, 80, 84, 56)
weight
[1] 60 72 80 84 56
height <- c(1.7, 1.75, 1.8, 1.9, 1.6)
height
[1] 1.70 1.75 1.80 1.90 1.60

Once we create a variable, we can do further calculations with it. Let’s say we want to transform the weight variable (in kg) to a new variable called weight2 (in lbs).

weight2 <- weight * 2.20462
weight2
[1] 132.3 158.7 176.4 185.2 123.5

Note that we named the variable as weight2. So, both weight and weight2 exist in the active R session now. If we used the following, this would overwrite the existing weight variable.

weight <- weight * 2.20462

We can also define a new variable based on existing variables.

reading <- c(80, 75, 50, 44, 65)
math <- c(90, 65, 60, 38, 70)
total <- reading + math
total
[1] 170 140 110  82 135

Sometimes we need a variable that holds character strings rather than numerical values. If a value is not numerical, we need to use double quotation marks. In the example below, we create a new variable called cities that has four city names in it. Each city name is written with double quotation marks.

cities <- c("Edmonton", "Calgary", "Red Deer", "Spruce Grove")
cities
[1] "Edmonton"     "Calgary"      "Red Deer"     "Spruce Grove"

We can also treat numerical values as character strings. For example, assume that we have a gender variable where 1=Male and 2=Female. We want R to know that these values are not actual numbers; instead, they are just numerical labels for gender groups.

gender <- c("1", "2", "2", "1", "2")
gender
[1] "1" "2" "2" "1" "2"

2.2.2 Important Rules for the R Language

Here is a list of important rules for using the R language more effectively:

  1. Case-sensitivity: R codes written in lowercase would NOT refer to the same codes written in uppercase.
cities <- c("Edmonton", "Calgary", "Red Deer", "Spruce Grove")
Cities
CITIES

Error: object 'Cities' not found
Error: object 'CITIES' not found
  1. Variable names: A variable name cannot begin with a number or include a space.
4cities <- c("Edmonton", "Calgary", "Red Deer", "Spruce Grove")
my cities <- c("Edmonton", "Calgary", "Red Deer", "Spruce Grove")

Error: unexpected symbol in "4cities"
Error: unexpected symbol in "my cities"
  1. Naming conventions: I recommend using consistent and clear naming conventions to keep the codes clear and organized. I personally prefer all lowercase with underscore (e.g., my_variable). The other naming conventions are:
  • All lowercase: e.g. mycities
  • Period.separated: e.g. my.cities
  • Underscore_separated: e.g. my_cities
  • Numbers at the end: e.g. mycities2018
  • Combination of some of these rules: my.cities.2018
  1. Commenting: The hashtag symbol (#) is used for commenting in R. Any words, codes, etc. coming after a hashtag are just ignored. I strongly recommend you to use comments throughout your codes. These annotations would remind you what you did in the codes and why you did it that way. You can easily comment out a line without having to remove it from your codes.
# Here I define four cities in Alberta
cities <- c("Edmonton", "Calgary", "Red Deer", "Spruce Grove")

2.2.3 Self-Help

In the spirit of open-source, R is very much a self-guided tool. We can look for solutions to R-related problems in multiple ways:

  1. Use the ? to open help pages for functions or packages (e.g., try ?summary in the console to see how the summary function works)

  2. For tricky questions and funky error messages (there are many of these), use Google (include “in R” to the end of your query)

  3. We can also use RSeek (https://rseek.org/) - a search engine just for R

  4. StackOverflow (https://stackoverflow.com/) has become a great resource with many questions for many specific packages in R, and a rating system for answers

2.2.4 Exercise 2

  1. Create two new variables age and salary for five persons:
  • age: 21, 24, 32, 45, 52
  • salary: 4500, 3500, 4100, 4700, 6000
  1. Then, type the following code in your script and run it to find the correlation between age and salary:
cor(age, salary)