Data Cleaning Using tidyr Package in R
Data analysis or Data preparation is the major task (or) plays an important role for decision making. It is said that about 70% of data analysis is spent on cleaning and structure/formatting the data.
It can be repeated many times over the analysis until we get meaningful insights from the data. To get a handle on the problems, the below representation focuses mainly on cleaning of the data.
R Dependencies
The tidyr
package was released on May 2017 and it will work with R (>= 3.1.0 version).
Installation and Importing the Packages into R
1install.packages("tidyr") ## Package installation step run only once
2library(tidyr) ## importing package into R
3
tidyr
Functions
- gather(): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
- spread(): Spread a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
- separate():Turns a single character column into multiple columns.
- unite(): Convenience function to paste multiple columns into one.
gather()
Function
gather()
takes multiple columns as input which are selected and transforms them into key-value pairs. It makes "wide" data longer. Here’s an example of how you might use gather()
on a sample dataset. In this experiment we've given sample data for people who applied for loan in a bank.
1personalDetails <- data.frame(
2 name = c("Shankar", "Shashi"),
3 age = c(30,45),
4 job_status = c("employed","unemployed"),
5 education = c("masters","tertiary")
6)
7
A dataframe with the name personalDetails
is created in R. We can view the data by using the command - View(personalDetails)
.
1> personalDetails
2 name age job_status education
31 Shankar 30 employed masters
42 Shashi 45 unemployed tertiary
5>
6
To know the structure of the data frame, use command "str(personalDetails)".
In the above table (we call it a dataframe in R) if we want to convert row data to columns data as key-value pair , the gather()
function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
Applying gather() Function
1
2personalDetails1 = gather(personalDetails,"job_status","education",c(3,4))
3colnames (personalDetails1) <- c("name","age","key","value")
4
gather() Function Definition
1gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
2
- data: data in the function indicates a dataframe to be passed
- key, value: As mentioned in the diagram , select any two columns which can be created as Names of key and value colmns to create in output.
- na.rm: If Value = TRUE , it removes all the columns which contains NA.
- convert: If TRUE will automatically run type.converton the key column. This is useful if the column names are actually numeric, integer, or logical.
- factor_key: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.
Gather() Function – Stock Data Example
Let us consider a sample dataframe of stocks. In the dataframe, we have 3 stocks A, B, and C with their date-wise stock closing prices.
1
2stocks <- data.frame(
3 Date = as.Date('2009-01-01') + 0:9,
4 A = runif(10, 100, 145),
5 B = runif(10, 350, 560),
6 C = runif(10, 34, 89)
7)
8
1
2> stocks
3 Date A B C
41 2009-01-01 110.7399 534.6632 48.14461
52 2009-01-02 110.5815 362.2626 47.68798
63 2009-01-03 123.9181 379.7840 79.47164
74 2009-01-04 100.2430 520.9548 42.94518
85 2009-01-05 103.4479 505.4492 64.06210
96 2009-01-06 125.2794 528.5937 58.91392
107 2009-01-07 137.5178 404.4673 39.73828
118 2009-01-08 136.7050 512.0356 62.33032
129 2009-01-09 101.5091 407.6105 54.02697
1310 2009-01-10 143.8108 500.7482 75.65964
14>
15
This dataset represents a good use case for the gather() function. All the three stocks, A, B, and C currently have their own columns. However, for meaningful analysis, it would be ideal to have one column called Stock and another column called Prices and then each row containing the observations.
This can be done as follows:
stockg <- gather(stocks,"stock","price",-Date)
The resulting table will look as follows:
1
2> stockg
3 Date stock price
41 2009-01-01 A 110.73995
52 2009-01-02 A 110.58154
63 2009-01-03 A 123.91811
74 2009-01-04 A 100.24299
85 2009-01-05 A 103.44793
96 2009-01-06 A 125.27943
107 2009-01-07 A 137.51782
118 2009-01-08 A 136.70504
129 2009-01-09 A 101.50905
1310 2009-01-10 A 143.81076
1411 2009-01-01 B 534.66320
1512 2009-01-02 B 362.26264
1613 2009-01-03 B 379.78404
1714 2009-01-04 B 520.95483
18...
19...
2030 2009-01-10 C 75.65964
21
spread()
The spread function spreads a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
Applying spread() Function
We can use our stock prices data to demonstrate the spread() function. Consider the long-form data that we got after applying the gather function. We can bring it back to its original form using the spread function.
1
2> stocksm <- spread(stockg,stock, price) > stocksm
3 Date A B C
41 2009-01-01 110.7399 534.6632 48.14461
52 2009-01-02 110.5815 362.2626 47.68798
63 2009-01-03 123.9181 379.7840 79.47164
74 2009-01-04 100.2430 520.9548 42.94518
85 2009-01-05 103.4479 505.4492 64.06210
96 2009-01-06 125.2794 528.5937 58.91392
107 2009-01-07 137.5178 404.4673 39.73828
118 2009-01-08 136.7050 512.0356 62.33032
129 2009-01-09 101.5091 407.6105 54.02697
1310 2009-01-10 143.8108 500.7482 75.65964
14>
15
Alternatively, we could also spread it with another column such as Date which will create a column for each date.
stocksm <- spread(stockg,Date, price)
Spread() Function Definition
1spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
2
- data: A data frame.
- key:: The bare (unquoted) name of the column whose values will be used as column headings.
- value: The bare (unquoted) name of the column whose values will populate the cells.
- fill: If set, missing values will be replaced with this value. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren't present. Both types of missing value will be replaced by fill.
- convert: If TRUE, type.convert with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.
- drop: If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.
- sep: If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by .
separate()
separate()
turns a single character column into multiple columns. Given either regular expression or a vector of character positions. separate()
turns a single character column into multiple columns.
In some cases a single column contains multiple variables in the data separated by a special character, if we want to split them into two columns we use separate()
function.
1
2df <- data.frame(x = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(df,Names, c("firstName", "lastName"))
4
separate()
Usage
1>separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE,
2extra = "warn", fill = "warn", ...)
3
- data: A data frame.
- col: Bare column name.
- into: Names of new variables to create as character vector.
- sep: Separator between columns.
If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.
- remove: If TRUE, remove input column from output data frame.
- convert: If TRUE, will run type.convert with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.
- extra: If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times.
- fill: If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:
- "warn" (the default): emit a warning and fill from the right
- "right": fill with missing values on the right
- "left": fill with missing values on the left
- ...: Defunct, will be removed in the next version of the package.
unite()
unite()
combines multiple columns into single column. It is convenience function to paste together multiple columns into one.
This function is opposite of separate()
, which merges multiple columns into single column .
Consider our earlier example:
1
2df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(df,Names, c("firstName", "lastName")) ##Separate function
4df = unite(df,"Name",firstName,lastName,sep="_") ## unite function
5
unite()
does the opposite of separate()
. It combines multiple columns into a single column.
unite()
Usage
unite(data, col, ..., sep = "_", remove = TRUE)
- data: A data frame.
- col: (Bare) name of column to add
- ...: Specification of columns to unite. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.
- sep: Separator to use between values.
- remove: If TRUE, remove input columns from output data frame.