Factors in R Programming
In R programming, factors are variables that take on a limited number of different values. Factors are used to represent categorical data.
Some examples of factors:
- A common example of a factor is gender, which can have category values as Male and Female.
- A data field such as marital status may contain only values from single, married, separated, divorced, or widowed.
- For stocks, we can have them categorized as Large-cap, Mid-cap, and Small-cap
In R, the function factor()
is used to encode a vector as a factor. In the following example, we first create a vector which for this example categorizes stocks as Large-cap, Mid-cap, and Small-cap. And then we use the factor()
function to encode this vector as a factor.
#The following vector classifies 5 stocks
stock_vector <- c("large-cap","small-cap","large-cap","mid-cap","small-cap")
# Convert the stock vector to a factor
stock_factor <- factor(stock_vector)
#Print the stock_factor
stock_factor
When you print this vector, the results will look as follows:
> stock_factor
[1] large-cap small-cap large-cap mid-cap small-cap
Levels: large-cap mid-cap small-cap
>
Levels and Order
When you print the factor, you can see that it also prints the Levels. By default, the levels are sorted based on their character value. However, you can change the order in which the levels will be displayed from their default sorted order, the levels=
argument can be given a vector of all the possible values of the variable in the order you desire.
Factors can be unordered or ordered. For example, we can consider the gender factor (Male and Female) to be an unordered factor as it is not important which ones come first. However, some other categories may have an order associated with them, for example, in our stock factor we may want to have them ordered as per their market capitalization (Mid-cap being the smallest and large-cap being the largest). If the ordering should also be used when performing comparisons, use the optional ordered=TRUE
argument. In this case, the factor is known as an ordered factor.
We can now update our factor to have pre-defined levels and set the order to TRUE.
# Convert the stock vector to a factor
stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("small-cap", "mid-cap", "large-cap"))
#Print the stock_factor
stock_factor
Results:
> #Print the stock_factor
> stock_factor
[1] large-cap small-cap large-cap mid-cap small-cap
Levels: small-cap < mid-cap < large-cap
>
Changing Levels
Sometimes, you may have a factor with values in it and you may want to change the names of those levels for more clarity or for relating it to something else in your model. In R, you can do so using the levels()
function. Let's say that our original factor contained letters L, M and S to represent the three types of stocks. We can change the levels to Large-cap, Mid-cap and Small-cap using the levels()
function.
# The following vector classifies 5 stocks
stock_vector <- c("L","S","L","M","S")
# Convert the stock vector to a factor
stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("S","M","L"))
levels(stock_factor) <- c("small-cap", "mid-cap", "large-cap")
#Print the stock_factor
stock_factor
In the results, you will have new levels applied to the factor.
> #Print the stock_factor
> stock_factor
[1] large-cap small-cap large-cap mid-cap small-cap
Levels: small-cap < mid-cap < large-cap
>
</div>
<h2>Summarize a Factor</h2>
<p>We can use the <code>summarize()</code> function to summarize the contents of the factor variable. As you can see, it prints a quick snapshot of how many stocks you have of each type in your portfolio.
</p>
<pre class="lang-r">> #Summarize stock_factor
> summary(stock_factor)
small-cap mid-cap large-cap
2 1 2
>
Use of Ordered Factor
In R the most apparent effect of using ordered vs. unordered factor is in pretty printing of the output. Apart from this, ordering and levels can be important in linear modelling because the first level is used as the baseline level. We will learn about these use cases when we learn linear modelling, however, here we will take a simple example to understand the use of ordering.
Let's say you have five stock traders in your team, and you have their performance evaluated as "Poor"
, "Average"
, and "Good"
. The following R script shows how we can compare the performances of these traders.
> # The following vector classifies 5 stocks
> performance_vector <- c("Good","Average","Poor","Poor","Good")
>
> # Convert the stock vector to a factor
> performance_factor <- factor(performance_vector, ordered=TRUE, levels=c("Poor","Average","Good"))
>
> #Print the stock_factor
> performance_factor
[1] Good Average Poor Poor Good
Levels: Poor < Average < Good
>
> #Summarize stock_factor
> summary(performance_factor)
Poor Average Good
2 1 2
>
> #Performance value of 2nd and 4th trader
> pv2 <- performance_factor[2]
> pv4 <- performance_factor[4]
>
> #Is trader 2 better than trader 4?
> pv2 > pv4
[1] TRUE
>
Note that if the factor was not ordered this comparison would not work. It will give you a warning message that comparison operator '>'
is not meaningful. However, once you set the ordered=TRUE
, it will recognize the comparison operator.
Statistical Modelling
One of the most important uses of factors is in statistical modeling. Since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. (Note: Categorical variables are different from continuous variables in that a categorical variable can take on a limited number of categories while a continuous variable can have an infinite number of values.)
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $39 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.