Data Cleaning Using tidyr Package in R
Data analysis or Data preparation is the major task (or) plays an important role for decision making. It is said that about 70% of data analysis is spent on cleaning and structure/formatting the data.
It can be repeated many times over the analysis until we get meaningful insights from the data. To get a handle on the problems, the below representation focuses mainly on cleaning of the data.
tidyr package was released on May 2017 and it will work with R (>= 3.1.0 version).
Installation and Importing the Packages into R
install.packages("tidyr") ## Package installation step run only once library(tidyr) ## importing package into R
- gather(): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
- spread(): Spread a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
- separate():Turns a single character column into multiple columns.
- unite(): Convenience function to paste multiple columns into one.
gather() takes multiple columns as input which are selected and transforms them into key-value pairs. It makes "wide" data longer. Here’s an example of how you might use
gather() on a sample dataset. In this experiment we've given sample data for people who applied for loan in a bank.
personalDetails <- data.frame( name = c("Shankar", "Shashi"), age = c(30,45), job_status = c("employed","unemployed"), education = c("masters","tertiary") )
A dataframe with the name
personalDetails is created in R. We can view the data by using the command -
> personalDetails name age job_status education 1 Shankar 30 employed masters 2 Shashi 45 unemployed tertiary >
To know the structure of the data frame, use command "str(personalDetails)".
In the above table (we call it a dataframe in R) if we want to convert row data to columns data as key-value pair , the
gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
Applying gather() Function
personalDetails1 = gather(personalDetails,"job_status","education",c(3,4)) colnames (personalDetails1) <- c("name","age","key","value")
gather() Function Definition
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
- data: data in the function indicates a dataframe to be passed
- key, value: As mentioned in the diagram , select any two columns which can be created as Names of key and value colmns to create in output.
- na.rm: If Value = TRUE , it removes all the columns which contains NA.
- convert: If TRUE will automatically run type.converton the key column. This is useful if the column names are actually numeric, integer, or logical.
- factor_key: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.
Gather() Function – Stock Data Example
Let us consider a sample dataframe of stocks. In the dataframe, we have 3 stocks A, B, and C with their date-wise stock closing prices.
stocks <- data.frame( Date = as.Date('2009-01-01') + 0:9, A = runif(10, 100, 145), B = runif(10, 350, 560), C = runif(10, 34, 89) )
> stocks Date A B C 1 2009-01-01 110.7399 534.6632 48.14461 2 2009-01-02 110.5815 362.2626 47.68798 3 2009-01-03 123.9181 379.7840 79.47164 4 2009-01-04 100.2430 520.9548 42.94518 5 2009-01-05 103.4479 505.4492 64.06210 6 2009-01-06 125.2794 528.5937 58.91392 7 2009-01-07 137.5178 404.4673 39.73828 8 2009-01-08 136.7050 512.0356 62.33032 9 2009-01-09 101.5091 407.6105 54.02697 10 2009-01-10 143.8108 500.7482 75.65964 >
This dataset represents a good use case for the gather() function. All the three stocks, A, B, and C currently have their own columns. However, for meaningful analysis, it would be ideal to have one column called Stock and another column called Prices and then each row containing the observations.
This can be done as follows:
stockg <- gather(stocks,"stock","price",-Date)
The resulting table will look as follows:
> stockg Date stock price 1 2009-01-01 A 110.73995 2 2009-01-02 A 110.58154 3 2009-01-03 A 123.91811 4 2009-01-04 A 100.24299 5 2009-01-05 A 103.44793 6 2009-01-06 A 125.27943 7 2009-01-07 A 137.51782 8 2009-01-08 A 136.70504 9 2009-01-09 A 101.50905 10 2009-01-10 A 143.81076 11 2009-01-01 B 534.66320 12 2009-01-02 B 362.26264 13 2009-01-03 B 379.78404 14 2009-01-04 B 520.95483 ... ... 30 2009-01-10 C 75.65964
The spread function spreads a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
Applying spread() Function
We can use our stock prices data to demonstrate the spread() function. Consider the long-form data that we got after applying the gather function. We can bring it back to its original form using the spread function.
> stocksm <- spread(stockg,stock, price) > stocksm Date A B C 1 2009-01-01 110.7399 534.6632 48.14461 2 2009-01-02 110.5815 362.2626 47.68798 3 2009-01-03 123.9181 379.7840 79.47164 4 2009-01-04 100.2430 520.9548 42.94518 5 2009-01-05 103.4479 505.4492 64.06210 6 2009-01-06 125.2794 528.5937 58.91392 7 2009-01-07 137.5178 404.4673 39.73828 8 2009-01-08 136.7050 512.0356 62.33032 9 2009-01-09 101.5091 407.6105 54.02697 10 2009-01-10 143.8108 500.7482 75.65964 >
Alternatively, we could also spread it with another column such as Date which will create a column for each date.
stocksm <- spread(stockg,Date, price)
Spread() Function Definition
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
- data: A data frame.
- key:: The bare (unquoted) name of the column whose values will be used as column headings.
- value: The bare (unquoted) name of the column whose values will populate the cells.
- fill: If set, missing values will be replaced with this value. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren't present. Both types of missing value will be replaced by fill.
- convert: If TRUE, type.convert with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.
- drop: If FALSE, will keep factor levels that don't appear in the data, filling in missing combinations with fill.
- sep: If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by .
separate() turns a single character column into multiple columns. Given either regular expression or a vector of character positions.
separate() turns a single character column into multiple columns.
In some cases a single column contains multiple variables in the data separated by a special character, if we want to split them into two columns we use
df <- data.frame(x = c("koti.minnun", "AB.Diwillars", "ST.Joesph")) df = separate(df,Names, c("firstName", "lastName"))
>separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
- data: A data frame.
- col: Bare column name.
- into: Names of new variables to create as character vector.
- sep: Separator between columns.
If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.
- remove: If TRUE, remove input column from output data frame.
- convert: If TRUE, will run type.convert with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.
- extra: If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times.
- fill: If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:
- "warn" (the default): emit a warning and fill from the right
- "right": fill with missing values on the right
- "left": fill with missing values on the left
- ...: Defunct, will be removed in the next version of the package.
unite() combines multiple columns into single column. It is convenience function to paste together multiple columns into one.
This function is opposite of
separate(), which merges multiple columns into single column .
Consider our earlier example:
df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph")) df = separate(df,Names, c("firstName", "lastName")) ##Separate function df = unite(df,"Name",firstName,lastName,sep="_") ## unite function
unite() does the opposite of
separate(). It combines multiple columns into a single column.
unite(data, col, ..., sep = "_", remove = TRUE)
- data: A data frame.
- col: (Bare) name of column to add
- ...: Specification of columns to unite. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the select documentation.
- sep: Separator to use between values.
- remove: If TRUE, remove input columns from output data frame.