Data Cleaning Using tidyr Package in R

Data analysis or Data preparation is the major task (or) plays an important role for decision making. It is said that about 70% of data analysis is spent on cleaning and structure/formatting the data.

It can be repeated many times over the analysis until we get meaningful insights from the data. To get a handle on the problems, the below representation focuses mainly on cleaning of the data.

R Dependencies

The tidyr package was released on May 2017 and it will work with R (>= 3.1.0 version).

Installation and Importing the Packages into R

1install.packages("tidyr")  ## Package installation step run only once
2library(tidyr)  ## importing  package into R
3

tidyr Functions

  • gather(): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
  • spread(): Spread a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
  • separate():Turns a single character column into multiple columns.
  • unite(): Convenience function to paste multiple columns into one.

gather() Function

gather() takes multiple columns as input which are selected and transforms them into key-value pairs. It makes "wide" data longer. Here’s an example of how you might use gather() on a sample dataset. In this experiment we've given sample data for people who applied for loan in a bank.

1personalDetails <- data.frame(
2  name = c("Shankar", "Shashi"),
3  age = c(30,45),
4  job_status = c("employed","unemployed"),
5  education = c("masters","tertiary")
6)
7

A dataframe with the name personalDetails is created in R. We can view the data by using the command - View(personalDetails).

1> personalDetails
2     name age job_status education
31 Shankar  30   employed   masters
42  Shashi  45 unemployed  tertiary
5> 
6

To know the structure of the data frame, use command "str(personalDetails)".

In the above table (we call it a dataframe in R) if we want to convert row data to columns data as key-value pair , the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

Applying gather() Function

1
2personalDetails1 = gather(personalDetails,"job_status","education",c(3,4))
3colnames (personalDetails1) <- c("name","age","key","value")
4

gather() Function Definition

1gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
2
  • data: data in the function indicates a dataframe to be passed
  • key, value: As mentioned in the diagram , select any two columns which can be created as Names of key and value colmns to create in output.
  • na.rm: If Value = TRUE , it removes all the columns which contains NA.
  • convert: If TRUE will automatically run type.converton the key column. This is useful if the column names are actually numeric, integer, or logical.
  • factor_key: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

Gather() Function – Stock Data Example

Let us consider a sample dataframe of stocks. In the dataframe, we have 3 stocks A, B, and C with their date-wise stock closing prices.

1
2stocks <- data.frame(
3  Date = as.Date('2009-01-01') + 0:9,
4  A = runif(10, 100, 145),
5  B = runif(10, 350, 560),
6  C = runif(10, 34, 89)
7)
8
1
2> stocks
3         Date        A        B        C
41  2009-01-01 110.7399 534.6632 48.14461
52  2009-01-02 110.5815 362.2626 47.68798
63  2009-01-03 123.9181 379.7840 79.47164
74  2009-01-04 100.2430 520.9548 42.94518
85  2009-01-05 103.4479 505.4492 64.06210
96  2009-01-06 125.2794 528.5937 58.91392
107  2009-01-07 137.5178 404.4673 39.73828
118  2009-01-08 136.7050 512.0356 62.33032
129  2009-01-09 101.5091 407.6105 54.02697
1310 2009-01-10 143.8108 500.7482 75.65964
14> 
15

This dataset represents a good use case for the gather() function. All the three stocks, A, B, and C currently have their own columns. However, for meaningful analysis, it would be ideal to have one column called Stock and another column called Prices and then each row containing the observations.

This can be done as follows:

stockg <- gather(stocks,"stock","price",-Date)

The resulting table will look as follows:

1
2> stockg
3         Date stock     price
41  2009-01-01     A 110.73995
52  2009-01-02     A 110.58154
63  2009-01-03     A 123.91811
74  2009-01-04     A 100.24299
85  2009-01-05     A 103.44793
96  2009-01-06     A 125.27943
107  2009-01-07     A 137.51782
118  2009-01-08     A 136.70504
129  2009-01-09     A 101.50905
1310 2009-01-10     A 143.81076
1411 2009-01-01     B 534.66320
1512 2009-01-02     B 362.26264
1613 2009-01-03     B 379.78404
1714 2009-01-04     B 520.95483
18...
19...
2030 2009-01-10     C  75.65964
21

spread()

The spread function spreads a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.

Applying spread() Function

We can use our stock prices data to demonstrate the spread() function. Consider the long-form data that we got after applying the gather function. We can bring it back to its original form using the spread function.

1
2> stocksm <- spread(stockg,stock, price) > stocksm
3         Date        A        B        C
41  2009-01-01 110.7399 534.6632 48.14461
52  2009-01-02 110.5815 362.2626 47.68798
63  2009-01-03 123.9181 379.7840 79.47164
74  2009-01-04 100.2430 520.9548 42.94518
85  2009-01-05 103.4479 505.4492 64.06210
96  2009-01-06 125.2794 528.5937 58.91392
107  2009-01-07 137.5178 404.4673 39.73828
118  2009-01-08 136.7050 512.0356 62.33032
129  2009-01-09 101.5091 407.6105 54.02697
1310 2009-01-10 143.8108 500.7482 75.65964
14> 
15

Alternatively, we could also spread it with another column such as Date which will create a column for each date.

stocksm <- spread(stockg,Date, price)

Spread() Function Definition

1spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
2
  • data: A data frame.
  • key:: The bare (unquoted) name of the column whose values will be used as column headings.
  • value: The bare (unquoted) name of the column whose values will populate the cells.
  • fill: If set, missing values will be replaced with this value. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren't present. Both types of missing value will be replaced by fill.
  • convert: If TRUE, type.convert with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.
  • drop: If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.
  • sep: If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by .

separate()

separate() turns a single character column into multiple columns. Given either regular expression or a vector of character positions. separate() turns a single character column into multiple columns.

In some cases a single column contains multiple variables in the data separated by a special character, if we want to split them into two columns we use separate() function.

1
2df <- data.frame(x = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(df,Names, c("firstName", "lastName"))
4

separate() Usage

1>separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, 
2extra = "warn", fill = "warn", ...)
3
  • data: A data frame.
  • col: Bare column name.
  • into: Names of new variables to create as character vector.
  • sep: Separator between columns.

If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.

  • remove: If TRUE, remove input column from output data frame.
  • convert: If TRUE, will run type.convert with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.
  • extra: If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:
    • "warn" (the default): emit a warning and drop extra values.
    • "drop": drop any extra values without a warning.
    • "merge": only splits at most length(into) times.
  • fill: If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:
    • "warn" (the default): emit a warning and fill from the right
    • "right": fill with missing values on the right
    • "left": fill with missing values on the left
  • ...: Defunct, will be removed in the next version of the package.

unite()

unite() combines multiple columns into single column. It is convenience function to paste together multiple columns into one.

This function is opposite of separate(), which merges multiple columns into single column .

Consider our earlier example:

1
2df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(df,Names, c("firstName", "lastName")) ##Separate function
4df = unite(df,"Name",firstName,lastName,sep="_") ## unite function
5

unite() does the opposite of separate(). It combines multiple columns into a single column.

unite() Usage

unite(data, col, ..., sep = "_", remove = TRUE)
  • data: A data frame.
  • col: (Bare) name of column to add
  • ...: Specification of columns to unite. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.
  • sep: Separator to use between values.
  • remove: If TRUE, remove input columns from output data frame.