Factors in R Programming
In R programming, factors are variables that take on a limited number of different values. Factors are used to represent categorical data.
Some examples of factors:
- A common example of a factor is gender, which can have category values as Male and Female.
- A data field such as marital status may contain only values from single, married, separated, divorced, or widowed.
- For stocks, we can have them categorized as Large-cap, Mid-cap, and Small-cap
In R, the function
factor() is used to encode a vector as a factor. In the following example, we first create a vector which for this example categorizes stocks as Large-cap, Mid-cap, and Small-cap. And then we use the
factor() function to encode this vector as a factor.
#The following vector classifies 5 stocks stock_vector <- c("large-cap","small-cap","large-cap","mid-cap","small-cap") # Convert the stock vector to a factor stock_factor <- factor(stock_vector) #Print the stock_factor stock_factor
When you print this vector, the results will look as follows:
> stock_factor  large-cap small-cap large-cap mid-cap small-cap Levels: large-cap mid-cap small-cap >
Levels and Order
When you print the factor, you can see that it also prints the Levels. By default, the levels are sorted based on their character value. However, you can change the order in which the levels will be displayed from their default sorted order, the
levels= argument can be given a vector of all the possible values of the variable in the order you desire.
Factors can be unordered or ordered. For example, we can consider the gender factor (Male and Female) to be an unordered factor as it is not important which ones come first. However, some other categories may have an order associated with them, for example, in our stock factor we may want to have them ordered as per their market capitalization (Mid-cap being the smallest and large-cap being the largest). If the ordering should also be used when performing comparisons, use the optional
ordered=TRUE argument. In this case, the factor is known as an ordered factor.
We can now update our factor to have pre-defined levels and set the order to TRUE.
# Convert the stock vector to a factor stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("small-cap", "mid-cap", "large-cap")) #Print the stock_factor stock_factor
> #Print the stock_factor > stock_factor  large-cap small-cap large-cap mid-cap small-cap Levels: small-cap < mid-cap < large-cap >
Sometimes, you may have a factor with values in it and you may want to change the names of those levels for more clarity or for relating it to something else in your model. In R, you can do so using the
levels() function. Let's say that our original factor contained letters L, M and S to represent the three types of stocks. We can change the levels to Large-cap, Mid-cap and Small-cap using the
# The following vector classifies 5 stocks stock_vector <- c("L","S","L","M","S") # Convert the stock vector to a factor stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("S","M","L")) levels(stock_factor) <- c("small-cap", "mid-cap", "large-cap") #Print the stock_factor stock_factor
In the results, you will have new levels applied to the factor.
> #Print the stock_factor > stock_factor  large-cap small-cap large-cap mid-cap small-cap Levels: small-cap < mid-cap < large-cap > </div> <h2>Summarize a Factor</h2> <p>We can use the <code>summarize()</code> function to summarize the contents of the factor variable. As you can see, it prints a quick snapshot of how many stocks you have of each type in your portfolio. </p> <pre class="lang-r">> #Summarize stock_factor > summary(stock_factor) small-cap mid-cap large-cap 2 1 2 >
Use of Ordered Factor
In R the most apparent effect of using ordered vs. unordered factor is in pretty printing of the output. Apart from this, ordering and levels can be important in linear modelling because the first level is used as the baseline level. We will learn about these use cases when we learn linear modelling, however, here we will take a simple example to understand the use of ordering.
Let's say you have five stock traders in your team, and you have their performance evaluated as
"Good". The following R script shows how we can compare the performances of these traders.
> # The following vector classifies 5 stocks > performance_vector <- c("Good","Average","Poor","Poor","Good") > > # Convert the stock vector to a factor > performance_factor <- factor(performance_vector, ordered=TRUE, levels=c("Poor","Average","Good")) > > #Print the stock_factor > performance_factor  Good Average Poor Poor Good Levels: Poor < Average < Good > > #Summarize stock_factor > summary(performance_factor) Poor Average Good 2 1 2 > > #Performance value of 2nd and 4th trader > pv2 <- performance_factor > pv4 <- performance_factor > > #Is trader 2 better than trader 4? > pv2 > pv4  TRUE >
Note that if the factor was not ordered this comparison would not work. It will give you a warning message that comparison operator
'>' is not meaningful. However, once you set the
ordered=TRUE, it will recognize the comparison operator.
One of the most important uses of factors is in statistical modeling. Since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. (Note: Categorical variables are different from continuous variables in that a categorical variable can take on a limited number of categories while a continuous variable can have an infinite number of values.)