The data science community has been growing over the past decade. The tools used within the industry have been a driving force behind this unprecedented growth. Hence, learning a programming language like R empowers us to contribute to this awesome field. This article will clarify all the basic concepts of R and look at the various concepts. Going through the article should give you sufficient information to start coding in R.
What is R?
R is a software environment developed for statistical and graphical computing. Developed at Bell Labs, it’s a modified implementation of the S language.
It provides many statistical techniques such as linear modeling, non-linear modeling, statistical tests, classification, and clustering algorithms, etc.
One of the greatest strengths of the R language is the large community that uses it. Therefore, many optimized libraries are readily available to make the lives of early programmers simple.
Before we begin, we need to ensure that the R interpreter is installed on our local machine. In this article, we’ll focus on programming concepts. You can use online editors like Repl.it for this tutorial.
For further information on installation, refer to this article.
The RStudio interface has four components that we make use of the most. Below is a preview of what the RStudio interface looks like.
The four components are as follows:
- Editor: The editor is used to create R scripts. R scripts are executed using the run button or the console. The run button is present at the top of the RStudio window. To execute a file using the console, we need to go through the following steps:
- Set the current working directory to the directory, including the R script. Use the command
setwd(dir)to set the working directory to the directory of your choice. Use
getwd()to get the current working directory
- We use the
sourcecommand to execute the script file of our choice.
source(file_name)executes the entire file and gives the output in the console section. The
source(filename)is typed and executed via the console, which we will cover in the next point.
- Set the current working directory to the directory, including the R script. Use the command
- Console and Terminal: The console in R is similar to IDLE in Python. We can execute any command here. For this tutorial, you may use the console.
- Environment Variables: The environment variables window list all the variables created during program execution and their corresponding values. It is used to understand the dimensionality and data-types of the data being used by a program.
- Plots, Package manager, and Help Section: The plots section generates any plots that have been specified in the R script. The Package Manager gives us a list of packages installed and lets us install packages if required. Finally, the help section is a go-to for developers of all levels. The ability to go through official docs through the RStudio empowers programmers and saves them much time.
In this article, we will be covering the following concepts:
- Variables, Data Types, and Operations.
Variables, Data Types, and Operations
R includes the following atomic data types: logical, numeric, integer, complex, raw, and character. Abstract data types such as lists, stacks, etc., are defined using these atomic data types.
Atomic data types are the most fundamental data structures. Abstract data types are building blocks used in complex scenarios. These abstract data types are created using atomic data types.
Variables are memory locations that store values during the execution of a computer program. To define a variable in R, we use the reverse-arrow(<-) operator.
Let’s look at the code sample below to learn how to declare variables.
variable_x <- 21 # Assign 21 to variable_x variable_y <- 54 # Assign 54 to variable_y variable_sum <- variable_x + variable_y # sum variable_x and variable_y and store it in variable_sum variable_complex <- 5 + 6i # assign 5 + 6i to a complex variable, variable_complex # Outputs 75 variable_character <- "Character" # assign "Character" to variable character variable_logical <- TRUE # assign boolean value TRUE to variable_logical # To check the type of a variable, use class function class(variable_sum) # Output: numeric class(variable_character) # Output: character class(variable_logical) # Output: logical class(variable_complex) # Output: complex
In the above example, we observe no explicit mentioning of the data type during the creation of variables. R does this automatically.
R supports all the necessary operations such as addition, subtraction, multiplication, division, modulus, exponentiation. More complex operations are also possible using these basic operations. Examples of complex operations are dot products, cross products, matrix determinants, matrix inverses, etc.
Vectors are data constructs that store many elements of a particular data type. Think of them when you want to store many elements belonging to a single data type. They are defined using the keyword
Let’s look at a few examples to get started with vectors:
vector_num <- c(2,4,6,8) # vector_num created with numerical elements vector_char <- c('a','b','c','c') # vector_char created with character elements vecor_logic <- c(FALSE, TRUE) # vector_logic created with boolean elements
All operations such as addition, subtraction, multiplication, and division can be performed on vectors. Do give it a try. For your reference, I have attached the code below. Try it out on the console and observe the output.
a <- c(1,2,3,4) # vector definition b <- c(2,4,6,8) # vector definition a + b # addition # output: 3 6 9 12 a - b # subtraction # output:-1 -2 -3 -4 a * b # multiplication # output: 2 8 18 32 a/ b # division # output: 0.5 0.5 0.5 0.5 a %% b # modulus(returns the remainder after division) # output: 1 2 3 4
Arrays are data structures built over vectors. Vectors are one-dimensional in nature, whereas arrays extend to multi-dimensions. The input arguments to an array are the vector of vectors and the dimension of the array. The product of the dimensions should be equal to the number of elements in all the input vectors.
Let’s consider the code example given below to declare multi-dimensional arrays. The number of elements is 11. We specify the dimension of the matrix to be 3x3. Therefore, 11 divided by 9 yields a remainder 2. Therefore, the last 2 elements are excluded from the final result.
vector_1 <- c(1,2,3,4) # vector definition vector_2 <- c(4,22,32,5,6,7,8) # vector definition array_from_vectors_1_and_2 <- array(c(vector_1, vector_2), dim=c(3,3)) # array creation using array keyword array_from_vectors_1_and_2 # print array defined in the previous step
The output consists of one matrix of dimension 3x3. The
dim keyword in the array enables multi-dimensional arrays. Vectors are limited to one-dimensional data. Observe the 3x3 matrix in the output given below:
array_from_vectors_1_and_2 [,1] [,2] [,3] [1,] 1 4 32 [2,] 2 4 5 [3,] 3 22 6
R lets us identify each index with a custom name. For example, let’s define a vector
first_quarter and initialize it with numbers 1,2, and 3. We represent another vector,
first_quarter_names, with names of the first three months.
names function, we can assign unique indexes to elements of
first_quarter. This helps in customizing the index names, which is useful in cases dealing with large amounts of data.
first_quarter <- c(1,2,3) # first_quarter vector definition first_quarter_names <- c("Jan","Feb", "Mar") # first_quarter_names holds the names of the elements names(first_quarter) <- first_quarter_names # assigning names to elements using names function
names function helps us label and tabulate the data for better understanding in the future — an essential tool for all data analysts across the world. Using the new indices created, we can replace them in place of the default array indices.
Let’s consider the example below, that illustrates this point.
# declared in the previous example, where first_quarter is assigned c(1,2,3) and names vector c("Jan","Feb","Mar") first_quarter["Jan"] # indexing elements using the assigned names
The above piece of code returns 1. Hence, we can use the above pair of lists and names as key-value pairs. Key-value pairs are beneficial in implementing hash tables, storing information, etc.
Lists can also be sliced. Slicing refers to the operation of obtaining a subset of serial elements from the list. For example, to get the first two months mentioned in the list
first_quarter, we can use the following code.
list_slice <- first_quarter[c(1:2)] # list_slice contains Jan and Feb as elements
If we observe, in the above code, we begin the slicing operation from index 1. R is a one-indexed programming language. Most of the programming languages like C, Java, and Python are zero-indexed programming languages. The index in R starts with 1; that is, the index assigned to the first element in a list is 1.
Matrices are 2-dimensional (rectangular) structures used commonly in the field of computer science. One application of matrices is in the field of computer vision. Each image is treated as a stack of three matrices, each representing a different color channel. R offers libraries highly optimized for matrix operations.
Let’s look at defining matrices.
Defining a Matrix
In R, we create new matrices using the
matrix() function. The
matrix() function takes a vector containing the elements, the orientation, and the number of rows. The orientation refers to the order of elements being stored. When the argument
TRUE, the elements are stored row-wise. To store column-wise, we define
bycol to be
matrix(c(1:16), bycol=TRUE, ncol=4) # the matrix's dimensions is 4*4, since 16/4 = 4, and num of columns = 4 matrix(c(1:16), byrow=TRUE, nrow=4) # num of rows = 4, therefore, num of cols = 16/4 = 4. Hence, matrix is 4x4
If the vector’s length is not a multiple of the number of rows/columns mentioned in the function, then R displays a warning and fills up the remaining spaces with the first few elements of the list.
matrix(c(1:16), byrow=TRUE, nrow=3) # outputs a matrix of dimension 3x6. Repeats data to fill in empty values
Since the number of elements is 16 and the number of rows specified is 3, each row should consist of 5.33 elements. The upper ceil of the value is taken, and therefore, each row should contain 6 elements. Observe the values 1 and 2 repeats in the last row.
The code above generates the following output.
Warning message: data length  is not a sub-multiple or multiple of the number of rows  [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 2 3 4 5 6 [2,] 7 8 9 10 11 12 [3,] 13 14 15 16 1 2
Just like vectors, matrices also have the option to name the rows and columns. The functions available are
colnames() respectively. They take in a vector containing the names corresponding to the indices. Consider the following example:
While defining the matrix
new_matrix, we can also specify the range of numbers that need to be stored in it. R automatically considers it as a vector and proceeds with the creation of the matrix. Therefore,
c(1:4) is equivalent to
1:4 while defining matrices.
new_matrix <- matrix(1:4, byrow=TRUE, ncol=2) # defning a new matrix of dimension 2x2 colnames(newm_matrix) <- c("column 1","column 2","column 3") # assigning names to column indices rownames(newm_matrix) <- c("row 1","row 2","row 3") # assigning names to row indices
Let’s say you want to compute the average of the entries in your matrix. We begin by adding them up and then dividing by the total number of entries. R offers functions to add the entire rows or columns. They are
colSums. These functions take the matrix as their input and output a vector containing the sums.
x <- matrix(1:9, byrow = TRUE, nrow=3) # 3x3 matrix defined x # outputs the matrix x row_sum <- rowSums(x) # sums computed rowwise col_sum <- colSums(x) # sums computed column wise row_sum # printing row_sum: computed as 1 + 2 + 3, 4 + 5 + 6, 7 + 8 + 9 col_sum # printing col_sum: computed as 1 + 4 + 7, 2 + 5 +8, 3 + 6 + 9
The output is given as follows:
x [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 row_sum  6 15 24 col_sum  12 15 18
Inserting new rows and columns to existing matrices
Let’s consider the situation where we need to modify a matrix by adding extra rows or columns. R provides two functions called
rbind() to help us do this.
The input to the function is the matrix to be modified followed by the vector to be inserted.
Consider the example given below:
x <- matrix(1:9, byrow = TRUE, nrow=3) # 3x3 matrix defined new_matrix_with_additional_column <- cbind(x, c(1,2,3)) # 3*4 matrix defined with additional column (1,2,3) # use the column binded matrix as input for row binding new_matrix_with_additional_column # print on console 3x4 matrix new_matrix_with_additional_row <- rbind(new_matrix_with_additional_column, c(4,5,6,4)) # 4 *4 matrix defined with additional row as (4,5,6,4) new_matrix_with_additional_row # printing the new matrix on console 4x4 matrix
The output for the code above is given as follows:
new_matrix_with_additional_column [,1] [,2] [,3] [,4] [1,] 1 2 3 1 [2,] 4 5 6 2 [3,] 7 8 9 3 new_matrix_with_additional_row [,1] [,2] [,3] [,4] [1,] 1 2 3 1 [2,] 4 5 6 2 [3,] 7 8 9 3 [4,] 4 5 6 4
Observe the notation [,1] vs. [1,] in the output matrices. These notations originate from the slicing operations in matrices. [1,] denotes the selection of the first row. [,1] denotes the selection of the first column. Therefore, R outputs these to denote the row and column numbers on the console.
The final concept to be covered under matrices is the selection of elements. Matrices in R work like 2-D arrays in other programming languages. Therefore all the slicing operations stand true in R. Let’s assume we have a
4x4 dimensional matrix. The first quarter is a
2x2 matrix. To get the first quarter of the matrix, we’ll use the following code:
x <- matrix(1:16, byrow=TRUE, nrow=4) # matrix definition x # output matrix on console. The numbers 1:16 denotes the list of numbers between x[1:2,1:2] # selection operation
The output is given as follows:
x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 [4,] 13 14 15 16 x[1:2,1:2] [,1] [,2] [1,] 1 2 [2,] 5 6
With the solid knowledge of vectors, matrices, and various data types, let’s shift gears and move towards factors and lists.
We use factors in cases when we deal with categorical features. Categorical features are defined as those features that are limited to a limited number of values it can take.
Continuous output variables, on the contrary, are limited to the entire real number space, that is infinite in length. Therefore, continuous variable lengths can take any value on the number line and are continuous in nature.
For example, classification algorithms have discrete output variables, whereas regression algorithms have continuous output variables. We can use vectors and matrices to store continuous output variables. But in the case of categorical values, we use factors. Factors also allow us to assign priorities to the categorical features through the use of the
R provides us with the function factor to deal with categorical values. Examples are the best way to understand the implementation. The example creates a vector and passes it to the
factor keyword. The
factor keyword creates the 3 categories with default levels.
factor_vector <- c("Big","Small","Medium") # creating a new vector factored_vector <- factor(factor_vector) # using the vector factor_vector to create a factor factored_vector # output the created factor
The code above outputs the following:
 Big Small Medium Levels: Big Medium Small
You might be wondering about the levels parameter present in the output. The levels signify the relative ordering between the categorical values. In the given example, all are equal by default. Let’s say we want to specify an order for the categories: Big to be 2, medium to be 1, and small to be 0.
Factor enables us to encode a vector of character values as integer values and therefore categorize them.
We update the code as follows:
factor_vector <- c("Big", "Small","Medium") # create a character vector comprising of categories factored_vector <- factor(factor_vector, order=TRUE, levels=c("Small","Medium","Big")) # assign levels to the categories using levels keyword, order is used to activate the levels. if order=FALSe, default order is considered factored_vector
We obtain the desired output.
 Big Small Medium Levels: Small < Medium < Big
The final function one must know under factors, that is used quite frequently, is
summary(). The summary function summarizes factors and gives the total number of occurrences of a particular category in the initial
factor_vector <- c("Big", "Small","Medium") factored_vector <- factor(factor_vector, order=TRUE, levels=c("Small","Medium","Big")) factored_vector summary(factored_vector)
The following code outputs:
Small Medium Big 1 1 1
An R list is an object consisting of an ordered collection of objects known as its components. They can hold elements belonging to various data types. These data types can be numerical, complex, character, Boolean, vectors, factors, or a matrix. Lists are the data structure that completes the R-ecosystem by providing many features. Let’s look at lists and the functions it offers.
Let’s begin by declaring a list.
vector_to_be_inserted_in_list <- c(1:5) # create a vector text_to_be_inserted_in_list <- "Text Here!!! " # create a character string matrix_to_be_inserted_in_list <- matrix(1:16, nrow=4, byrow=TRUE) # create a 4X4 matrix # define the new list using list keyword with input arguments as the vectors defined earlier # it has vectors, character and matrix as its input new_list <- list(vector_to_be_inserted_in_list,text_to_be_inserted_in_list,matrix_to_be_inserted_in_list) new_list
The list is displayed as given below. The entire output displayed below is the list
new_list defined above.
The first part of the output is the numerical vector. The second part corresponds to the character, and the final part is the 4x4 matrix defined.
[]  1 2 3 4 5 []  "Text Here!!! " [] [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 [4,] 13 14 15 16
We can get the elements of the list using indexes. To get the matrix, type in
new_list in the console. Note that R is a one-indexed language. All the indices begin with 1.
Lists also allow us to name the elements. The
names function is used for the same. It gives a name to each element in the list,
vector_to_be_inserted_in_list <- c(1:5) # vector definition text_to_be_inserted_in_list <- "Text Here!!! " # string definition matrix_to_be_inserted_in_list <- matrix(1:16, nrow=4, byrow=TRUE) # matrix definition new_list <- list(vector_to_be_inserted_in_list,text_to_be_inserted_in_list,matrix_to_be_inserted_in_list) names(new_list) <- c("vector", "text" ,"matrix") # names vector assigned to each element of new_list new_list
new_list is updated with names shown below:
$vector  1 2 3 4 5 $text  "Text Here!!! " $matrix [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 [4,] 13 14 15 16
Using the names, we can select the elements. To select the text using the name, we use the following command:
The general syntax is the name of the variable, followed by
$, followed by the name assigned:
In this article, we have learned the basics of R programming. It’s a potent tool used extensively in the data science community. Try out the code given to gain the maximum value from the tutorial. Also, try out Repl.it and use it often. It provides hassle-free access to faster experimentation and prototyping.
Peer Review Contributions by: Sophia Raji