Data Cleaning Using tidyr Package in R | Finance Train

Data Cleaning Using tidyr Package in R

July 3, 2017

9 min read

Data analysis or Data preparation is the major task (or) plays an important role for decision making. It is said that about 70% of data analysis is spent on cleaning and structure/formatting the data.

It can be repeated many times over the analysis until we get meaningful insights from the data. To get a handle on the problems, the below representation focuses mainly on cleaning of the data.

R Dependencies

The tidyr package was released on May 2017 and it will work with R (>= 3.1.0 version).

Installation and Importing the Packages into R

1install.packages("tidyr")  ## Package installation step run only once
2library(tidyr)  ## importing  package into R

1
2> stocks
3         Date        A        B        C
41  2009-01-01 110.7399 534.6632 48.14461
52  2009-01-02 110.5815 362.2626 47.68798
63  2009-01-03 123.9181 379.7840 79.47164
74  2009-01-04 100.2430 520.9548 42.94518
85  2009-01-05 103.4479 505.4492 64.06210
96  2009-01-06 125.2794 528.5937 58.91392
107  2009-01-07 137.5178 404.4673 39.73828
118  2009-01-08 136.7050 512.0356 62.33032
129  2009-01-09 101.5091 407.6105 54.02697
1310 2009-01-10 143.8108 500.7482 75.65964
14> 
15

stockg <- gather(stocks,"stock","price",-Date)

1
2> stockg
3         Date stock     price
41  2009-01-01     A 110.73995
52  2009-01-02     A 110.58154
63  2009-01-03     A 123.91811
74  2009-01-04     A 100.24299
85  2009-01-05     A 103.44793
96  2009-01-06     A 125.27943
107  2009-01-07     A 137.51782
118  2009-01-08     A 136.70504
129  2009-01-09     A 101.50905
1310 2009-01-10     A 143.81076
1411 2009-01-01     B 534.66320
1512 2009-01-02     B 362.26264
1613 2009-01-03     B 379.78404
1714 2009-01-04     B 520.95483
18...
19...
2030 2009-01-10     C  75.65964
21

1
2> stocksm <- spread(stockg,stock, price) > stocksm
3         Date        A        B        C
41  2009-01-01 110.7399 534.6632 48.14461
52  2009-01-02 110.5815 362.2626 47.68798
63  2009-01-03 123.9181 379.7840 79.47164
74  2009-01-04 100.2430 520.9548 42.94518
85  2009-01-05 103.4479 505.4492 64.06210
96  2009-01-06 125.2794 528.5937 58.91392
107  2009-01-07 137.5178 404.4673 39.73828
118  2009-01-08 136.7050 512.0356 62.33032
129  2009-01-09 101.5091 407.6105 54.02697
1310 2009-01-10 143.8108 500.7482 75.65964
14> 
15

stocksm <- spread(stockg,Date, price)

1
2df <- data.frame(Names = c("koti.minnun", "AB.Diwillars", "ST.Joesph"))
3df = separate(df,Names, c("firstName", "lastName")) ##Separate function
4df = unite(df,"Name",firstName,lastName,sep="_") ## unite function
5

unite(data, col, ..., sep = "_", remove = TRUE)

Data Cleaning Using tidyr Package in R

R Dependencies

Installation and Importing the Packages into R

`tidyr` Functions

`gather()` Function

Applying gather() Function

gather() Function Definition

Gather() Function – Stock Data Example

spread()

Applying spread() Function

Spread() Function Definition

separate()

`separate()` Usage

unite()

`unite()` Usage

R Dependencies

Installation and Importing the Packages into R

tidyr Functions

gather() Function

Applying gather() Function

gather() Function Definition

Gather() Function – Stock Data Example

spread()

Applying spread() Function

Spread() Function Definition

separate()

separate() Usage

unite()

unite() Usage

`tidyr` Functions

`gather()` Function

`separate()` Usage

`unite()` Usage