Factors in R Programming
In R programming, factors are variables that take on a limited number of different values. Factors are used to represent categorical data.
Some examples of factors:
- A common example of a factor is gender, which can have category values as Male and Female.
- A data field such as marital status may contain only values from single, married, separated, divorced, or widowed.
- For stocks, we can have them categorized as Large-cap, Mid-cap, and Small-cap
In R, the function factor()
is used to encode a vector as a factor. In the following example, we first create a vector which for this example categorizes stocks as Large-cap, Mid-cap, and Small-cap. And then we use the factor()
function to encode this vector as a factor.
1#The following vector classifies 5 stocks
2 stock_vector <- c("large-cap","small-cap","large-cap","mid-cap","small-cap")
3 # Convert the stock vector to a factor
4 stock_factor <- factor(stock_vector)
5 #Print the stock_factor
6 stock_factor
7
When you print this vector, the results will look as follows:
1> stock_factor
2 [1] large-cap small-cap large-cap mid-cap small-cap
3 Levels: large-cap mid-cap small-cap
4 >
5
Levels and Order
When you print the factor, you can see that it also prints the Levels. By default, the levels are sorted based on their character value. However, you can change the order in which the levels will be displayed from their default sorted order, the levels=
argument can be given a vector of all the possible values of the variable in the order you desire.
Factors can be unordered or ordered. For example, we can consider the gender factor (Male and Female) to be an unordered factor as it is not important which ones come first. However, some other categories may have an order associated with them, for example, in our stock factor we may want to have them ordered as per their market capitalization (Mid-cap being the smallest and large-cap being the largest). If the ordering should also be used when performing comparisons, use the optional ordered=TRUE
argument. In this case, the factor is known as an ordered factor.
We can now update our factor to have pre-defined levels and set the order to TRUE.
1# Convert the stock vector to a factor
2 stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("small-cap", "mid-cap", "large-cap"))
3 #Print the stock_factor
4 stock_factor
5
Results:
1> #Print the stock_factor
2 > stock_factor
3 [1] large-cap small-cap large-cap mid-cap small-cap
4 Levels: small-cap < mid-cap < large-cap
5 >
6
Changing Levels
Sometimes, you may have a factor with values in it and you may want to change the names of those levels for more clarity or for relating it to something else in your model. In R, you can do so using the levels()
function. Let's say that our original factor contained letters L, M and S to represent the three types of stocks. We can change the levels to Large-cap, Mid-cap and Small-cap using the levels()
function.
1# The following vector classifies 5 stocks
2 stock_vector <- c("L","S","L","M","S")
3 # Convert the stock vector to a factor
4 stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("S","M","L"))
5 levels(stock_factor) <- c("small-cap", "mid-cap", "large-cap")
6 #Print the stock_factor
7 stock_factor
8
In the results, you will have new levels applied to the factor.
1> #Print the stock_factor
2 > stock_factor
3 [1] large-cap small-cap large-cap mid-cap small-cap
4 Levels: small-cap < mid-cap < large-cap
5 >
6</div>
7<h2>Summarize a Factor</h2>
8<p>We can use the <code>summarize()</code> function to summarize the contents of the factor variable. As you can see, it prints a quick snapshot of how many stocks you have of each type in your portfolio.
9</p>
10<pre class="lang-r">> #Summarize stock_factor
11> summary(stock_factor)
12small-cap mid-cap large-cap
13 2 1 2
14>
15
Use of Ordered Factor
In R the most apparent effect of using ordered vs. unordered factor is in pretty printing of the output. Apart from this, ordering and levels can be important in linear modelling because the first level is used as the baseline level. We will learn about these use cases when we learn linear modelling, however, here we will take a simple example to understand the use of ordering.
Let's say you have five stock traders in your team, and you have their performance evaluated as "Poor"
, "Average"
, and "Good"
. The following R script shows how we can compare the performances of these traders.
1> # The following vector classifies 5 stocks
2> performance_vector <- c("Good","Average","Poor","Poor","Good")
3>
4> # Convert the stock vector to a factor
5> performance_factor <- factor(performance_vector, ordered=TRUE, levels=c("Poor","Average","Good"))
6>
7> #Print the stock_factor
8> performance_factor
9[1] Good Average Poor Poor Good
10Levels: Poor < Average < Good
11>
12> #Summarize stock_factor
13> summary(performance_factor)
14 Poor Average Good
15 2 1 2
16>
17> #Performance value of 2nd and 4th trader
18> pv2 <- performance_factor[2]
19> pv4 <- performance_factor[4]
20>
21> #Is trader 2 better than trader 4?
22> pv2 > pv4
23[1] TRUE
24>
25
Note that if the factor was not ordered this comparison would not work. It will give you a warning message that comparison operator '>'
is not meaningful. However, once you set the ordered=TRUE
, it will recognize the comparison operator.
Statistical Modelling
One of the most important uses of factors is in statistical modeling. Since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. (Note: Categorical variables are different from continuous variables in that a categorical variable can take on a limited number of categories while a continuous variable can have an infinite number of values.)
Create Your Free Account
Create a free account to access this content and join our community of learners.
You'll get access to:
- Access the full tutorial
- Join our learning community
- Track your progress
- Bookmark content for later