Factors in R Programming

In R programming, factors are variables that take on a limited number of different values. Factors are used to represent categorical data.

Some examples of factors:

  • A common example of a factor is gender, which can have category values as Male and Female.
  • A data field such as marital status may contain only values from single, married, separated, divorced, or widowed.
  • For stocks, we can have them categorized as Large-cap, Mid-cap, and Small-cap

In R, the function factor() is used to encode a vector as a factor. In the following example, we first create a vector which for this example categorizes stocks as Large-cap, Mid-cap, and Small-cap. And then we use the factor() function to encode this vector as a factor.

#The following vector classifies 5 stocks
    stock_vector <- c("large-cap","small-cap","large-cap","mid-cap","small-cap")
    # Convert the stock vector to a factor
    stock_factor <- factor(stock_vector)
    #Print the stock_factor
    stock_factor

When you print this vector, the results will look as follows:

> stock_factor
    [1] large-cap small-cap large-cap mid-cap   small-cap
    Levels: large-cap mid-cap small-cap
    >

Levels and Order

When you print the factor, you can see that it also prints the Levels. By default, the levels are sorted based on their character value. However, you can change the order in which the levels will be displayed from their default sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you desire.

Factors can be unordered or ordered. For example, we can consider the gender factor (Male and Female) to be an unordered factor as it is not important which ones come first. However, some other categories may have an order associated with them, for example, in our stock factor we may want to have them ordered as per their market capitalization (Mid-cap being the smallest and large-cap being the largest). If the ordering should also be used when performing comparisons, use the optional ordered=TRUE argument. In this case, the factor is known as an ordered factor.

We can now update our factor to have pre-defined levels and set the order to TRUE.

# Convert the stock vector to a factor
    stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("small-cap", "mid-cap", "large-cap"))
    #Print the stock_factor
    stock_factor

Results:

> #Print the stock_factor
    > stock_factor
    [1] large-cap small-cap large-cap mid-cap   small-cap
    Levels: small-cap < mid-cap < large-cap
    >

Changing Levels

Sometimes, you may have a factor with values in it and you may want to change the names of those levels for more clarity or for relating it to something else in your model. In R, you can do so using the levels() function. Let's say that our original factor contained letters L, M and S to represent the three types of stocks. We can change the levels to Large-cap, Mid-cap and Small-cap using the levels() function.

# The following vector classifies 5 stocks
    stock_vector <- c("L","S","L","M","S")
    # Convert the stock vector to a factor
    stock_factor <- factor(stock_vector, ordered=TRUE, levels=c("S","M","L"))
    levels(stock_factor) <- c("small-cap", "mid-cap", "large-cap")
    #Print the stock_factor
    stock_factor

In the results, you will have new levels applied to the factor.

> #Print the stock_factor
    > stock_factor
    [1] large-cap small-cap large-cap mid-cap   small-cap
    Levels: small-cap < mid-cap < large-cap
    >
</div>
<h2>Summarize a Factor</h2>
<p>We can use the <code>summarize()</code> function to summarize the contents of the factor variable. As you can see, it prints a quick snapshot of how many stocks you have of each type in your portfolio.
</p>
<pre class="lang-r">> #Summarize stock_factor
> summary(stock_factor)
small-cap   mid-cap large-cap 
        2         1         2 
>

Use of Ordered Factor

In R the most apparent effect of using ordered vs. unordered factor is in pretty printing of the output. Apart from this, ordering and levels can be important in linear modelling because the first level is used as the baseline level. We will learn about these use cases when we learn linear modelling, however, here we will take a simple example to understand the use of ordering.

Let's say you have five stock traders in your team, and you have their performance evaluated as "Poor""Average", and "Good". The following R script shows how we can compare the performances of these traders.

> # The following vector classifies 5 stocks
> performance_vector <- c("Good","Average","Poor","Poor","Good")
> 
> # Convert the stock vector to a factor
> performance_factor <- factor(performance_vector, ordered=TRUE, levels=c("Poor","Average","Good"))
> 
> #Print the stock_factor
> performance_factor
[1] Good    Average Poor    Poor    Good   
Levels: Poor < Average < Good
> 
> #Summarize stock_factor
> summary(performance_factor)
   Poor Average    Good 
      2       1       2 
> 
> #Performance value of 2nd and 4th trader
> pv2 <- performance_factor[2]
> pv4 <- performance_factor[4]
> 
> #Is trader 2 better than trader 4?
> pv2 > pv4
[1] TRUE
>

Note that if the factor was not ordered this comparison would not work. It will give you a warning message that comparison operator '>' is not meaningful. However, once you set the ordered=TRUE, it will recognize the comparison operator.

Statistical Modelling

One of the most important uses of factors is in statistical modeling. Since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. (Note: Categorical variables are different from continuous variables in that a categorical variable can take on a limited number of categories while a continuous variable can have an infinite number of values.)

Please login to view this lesson.

With our free registration, you can access to all the lessons on finance, risk, data analytics and data science for finance professionals.

Sign in free