Data Cleaning Using tidyr Package in R

Data analysis or Data preparation is the major task (or) plays an important role for decision making. It is said that about 70% of data analysis is spent on cleaning and structure/formatting the data.

It can be repeated many times over the analysis until we get meaningful insights from the data. To get a handle on the problems, the below representation focuses mainly on cleaning of the data.

R Dependencies

The tidyr package was released on May 2017 and it will work with R (>= 3.1.0 version).

Installation and Importing the Packages into R

install.packages("tidyr")  ## Package installation step run only once
library(tidyr)  ## importing  package into R

tidyr Functions

  • gather(): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
  • spread(): Spread a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
  • separate():Turns a single character column into multiple columns.
  • unite(): Convenience function to paste multiple columns into one.

gather() Function

gather() takes multiple columns as input which are selected and transforms them into key-value pairs. It makes "wide" data longer. Here’s an example of how you might use gather() on a sample dataset. In this experiment we've given sample data for people who applied for loan in a bank.

personalDetails <- data.frame(
  name = c("Shankar", "Shashi"),
  age = c(30,45),
  job_status = c("employed","unemployed"),
  education = c("masters","tertiary")
)

A dataframe with the name personalDetails is created in R. We can view the data by using the command - View(personalDetails).

> personalDetails
     name age job_status education
1 Shankar  30   employed   masters
2  Shashi  45 unemployed  tertiary
> 

To know the structure of the data frame, use command "str(personalDetails)".

In the above table (we call it a dataframe in R) if we want to convert row data to columns data as key-value pair , the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

Applying gather() Function


personalDetails1 = gather(personalDetails,"job_status","education",c(3,4))
colnames (personalDetails1) <- c("name","age","key","value")

gather() Function Definition

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
  • data: data in the function indicates a dataframe to be passed
  • key, value: As mentioned in the diagram , select any two columns which can be created as Names of key and value colmns to create in output.
  • na.rm: If Value = TRUE , it removes all the columns which contains NA.
  • convert: If TRUE will automatically run type.converton the key column. This is useful if the column names are actually numeric, integer, or logical.
  • factor_key: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

Gather() Function – Stock Data Example

Let us consider a sample dataframe of stocks. In the dataframe, we have 3 stocks A, B, and C with their date-wise stock closing prices.


stocks <- data.frame(
  Date = as.Date('2009-01-01') + 0:9,
  A = runif(10, 100, 145),
  B = runif(10, 350, 560),
  C = runif(10, 34, 89)
)

> stocks
         Date        A        B        C
1  2009-01-01 110.7399 534.6632 48.14461
2  2009-01-02 110.5815 362.2626 47.68798
3  2009-01-03 123.9181 379.7840 79.47164
4  2009-01-04 100.2430 520.9548 42.94518
5  2009-01-05 103.4479 505.4492 64.06210
6  2009-01-06 125.2794 528.5937 58.91392
7  2009-01-07 137.5178 404.4673 39.73828
8  2009-01-08 136.7050 512.0356 62.33032
9  2009-01-09 101.5091 407.6105 54.02697
10 2009-01-10 143.8108 500.7482 75.65964
> 

This dataset represents a good use case for the gather() function. All the three stocks, A, B, and C currently have their own columns. However, for meaningful analysis, it would be ideal to have one column called Stock and another column called Prices and then each row containing the observations.

This can be done as follows:


stockg <- gather(stocks,"stock","price",-Date)

The resulting table will look as follows:


> stockg
         Date stock     price
1  2009-01-01     A 110.73995
2  2009-01-02     A 110.58154
3  2009-01-03     A 123.91811
4  2009-01-04     A 100.24299
5  2009-01-05     A 103.44793
6  2009-01-06     A 125.27943
7  2009-01-07     A 137.51782
8  2009-01-08     A 136.70504
9  2009-01-09     A 101.50905
10 2009-01-10     A 143.81076
11 2009-01-01     B 534.66320
12 2009-01-02     B 362.26264
13 2009-01-03     B 379.78404
14 2009-01-04     B 520.95483
...
...
30 2009-01-10     C  75.65964

spread()

The spread function spreads a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.

Applying spread() Function

We can use our stock prices data to demonstrate the spread() function. Consider the long-form data that we got after applying the gather function. We can bring it back to its original form using the spread function.


> stocksm <- spread(stockg,stock, price) > stocksm
         Date        A        B        C
1  2009-01-01 110.7399 534.6632 48.14461
2  2009-01-02 110.5815 362.2626 47.68798
3  2009-01-03 123.9181 379.7840 79.47164
4  2009-01-04 100.2430 520.9548 42.94518
5  2009-01-05 103.4479 505.4492 64.06210
6  2009-01-06 125.2794 528.5937 58.91392
7  2009-01-07 137.5178 404.4673 39.73828
8  2009-01-08 136.7050 512.0356 62.33032
9  2009-01-09 101.5091 407.6105 54.02697
10 2009-01-10 143.8108 500.7482 75.65964
> 

Alternatively, we could also spread it with another column such as Date which will create a column for each date.


stocksm <- spread(stockg,Date, price)

Spread() Function Definition

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
  • data: A data frame.
  • key:: The bare (unquoted) name of the column whose values will be used as column headings.
  • value: The bare (unquoted) name of the column whose values will populate the cells.
  • fill: If set, missing values will be replaced with this value. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren't present. Both types of missing value will be replaced by fill.
  • convert: If TRUE, type.convert with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.
  • drop: If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.
  • sep: If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by .

separate()

separate() turns a single character column into multiple columns. Given either regular expression or a vector of character positions. separate() turns a single character column into multiple columns.

In some cases a single column contains multiple variables in the data separated by a special character, if we want to split them into two columns we use separate() function.


df <- data.frame(x = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
df = separate(df,Names, c("firstName", "lastName"))

separate() Usage

>separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, 
extra = "warn", fill = "warn", ...)
  • data: A data frame.
  • col: Bare column name.
  • into: Names of new variables to create as character vector.
  • sep: Separator between columns.

If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.

  • remove: If TRUE, remove input column from output data frame.
  • convert: If TRUE, will run type.convert with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.
  • extra: If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:
    • "warn" (the default): emit a warning and drop extra values.
    • "drop": drop any extra values without a warning.
    • "merge": only splits at most length(into) times.
  • fill: If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:
    • "warn" (the default): emit a warning and fill from the right
    • "right": fill with missing values on the right
    • "left": fill with missing values on the left
  • ...: Defunct, will be removed in the next version of the package.

unite()

unite() combines multiple columns into single column. It is convenience function to paste together multiple columns into one.

This function is opposite of separate(), which merges multiple columns into single column .

Consider our earlier example:


df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
df = separate(df,Names, c("firstName", "lastName")) ##Separate function
df = unite(df,"Name",firstName,lastName,sep="_") ## unite function

unite() does the opposite of separate(). It combines multiple columns into a single column.

unite() Usage

unite(data, col, ..., sep = "_", remove = TRUE)
  • data: A data frame.
  • col: (Bare) name of column to add
  • ...: Specification of columns to unite. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.
  • sep: Separator to use between values.
  • remove: If TRUE, remove input columns from output data frame.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $39 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.