Get full access to all Data Science, Machine Learning, and AI courses built for finance professionals.
One-time payment - Lifetime access
Or create a free account to start
A step-by-step guide covering Python, SQL, analytics, and finance applications.
Or create a free account to access more
Get full access to all Data Science, Machine Learning, and AI courses built for finance professionals.
One-time payment - Lifetime access
Or create a free account to start
A step-by-step guide covering Python, SQL, analytics, and finance applications.
Or create a free account to access more
Data analysis or Data preparation is the major task (or) plays an important role for decision making. It is said that about 70% of data analysis is spent on cleaning and structure/formatting the data.
It can be repeated many times over the analysis until we get meaningful insights from the data. To get a handle on the problems, the below representation focuses mainly on cleaning of the data.
The tidyr package was released on May 2017 and it will work with R (>= 3.1.0 version).
1install.packages("tidyr") ## Package installation step run only once
2library(tidyr) ## importing package into R
3tidyr Functionsgather() Functiongather() takes multiple columns as input which are selected and transforms them into key-value pairs. It makes "wide" data longer. Here’s an example of how you might use gather() on a sample dataset. In this experiment we've given sample data for people who applied for loan in a bank.
1personalDetails <- data.frame(
2 name = c("Shankar", "Shashi"),
3 age = c(30,45A dataframe with the name personalDetails is created in R. We can view the data by using the command - View(personalDetails).
1> personalDetails
2 name age job_status education
31 Shankar 30 employed masters
42 Shashi 45 unemployed tertiary
5>
6To know the structure of the data frame, use command "str(personalDetails)".
In the above table (we call it a dataframe in R) if we want to convert row data to columns data as key-value pair , the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
1
2personalDetails1 = gather(personalDetails,"job_status","education",c(3,4))
3colnames (personalDetails1) <
1gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
2Let us consider a sample dataframe of stocks. In the dataframe, we have 3 stocks A, B, and C with their date-wise stock closing prices.
1
2stocks <- data.frame(
3 Date = as.Date('2009-01-01') + 0:9,
1
2> stocks
3 Date A B C
41 2009-01-01 110.7399 534.6632 48.14461
5This dataset represents a good use case for the gather() function. All the three stocks, A, B, and C currently have their own columns. However, for meaningful analysis, it would be ideal to have one column called Stock and another column called Prices and then each row containing the observations.
This can be done as follows:
stockg <- gather(stocks,"stock","price",-Date)
The resulting table will look as follows:
1
2> stockg
3 Date stock price
41 2009-01-01 A 110.73995
52 2009-The spread function spreads a key-value pair across multiple columns. It takes two columns (key & value) and spreads into multiple columns. It makes "long" data wider.
We can use our stock prices data to demonstrate the spread() function. Consider the long-form data that we got after applying the gather function. We can bring it back to its original form using the spread function.
1
2> stocksm <- spread(stockg,stock, price) > stocksm
3 Date A B C
41 2009-01Alternatively, we could also spread it with another column such as Date which will create a column for each date.
stocksm <- spread(stockg,Date, price)

1spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
2separate() turns a single character column into multiple columns. Given either regular expression or a vector of character positions. separate() turns a single character column into multiple columns.
In some cases a single column contains multiple variables in the data separated by a special character, if we want to split them into two columns we use separate() function.
1
2df <- data.frame(x = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(
separate() Usage1>separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE,
2extra = "warn", fill If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.
unite() combines multiple columns into single column. It is convenience function to paste together multiple columns into one.
This function is opposite of separate(), which merges multiple columns into single column .
Consider our earlier example:
1
2df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(
unite() does the opposite of separate(). It combines multiple columns into a single column.
unite() Usageunite(data, col, ..., sep = "_", remove = TRUE)