Most companies use relational databases to store their data and it is important for a data scientist to know how they can import data from these data bases into R. Some examples of popular relational databases include MySQL, MS SQL Server, and Oracle. Luckily for us, the R community has created R packages for almost every database that we can use to connect to and retrieve information from them.
In simplest terms, a relational database is one that presents information in tables with rows and columns. A table is referred to as a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can be related according to common keys or concepts, and the ability to retrieve related data from a table is the basis for the term relational database.
In this lesson, we will take an example of a MySQL database. We will connect to the database, fetch the list of tables in the database and then learn how to import data from these tables.
Our Sample Database
Our database is named enwrap86_ftstockdata and contains two tables:
1. Stocks: This table contains stock information that does not change frequently, such as the company name and its earnings per share (EPS).
2. Prices: This table contains stock data, accumulated over a period of time. In this table, we record the following information for each stock: its daily opening and closing price per share, its daily high and low price per share, and the volume of shares traded.
Install The Database Connector Package
Depending on the database you're connecting to, you will need to install the appropriate package. Some examples of R packages are:
MySQL - RMySQL
PostgresSQL - RPostgresSQL
Oracle Database - ROracle
Since we're using MySQL database, we will install RMySQL package. How you interact with the database, that is which R functions you use to access and manipulate the data is specified in another R package called DBI. In more technical terms, DBI is the interface and RMySQL is the implementation. Installing a database package such as RMySQL will also automatically install the DBI package. The RMySQL package will be installed using the install.packages("RMySQL") command.
> install.packages("RMySQL")
Establish a Connection
Now that you have the right packages, you can load the package in the R session and establish a connection with the database. Loading just the DBI package is sufficient.
We use dbConnect() to create a connection between the R session and a SQL database. The function takes many arguments in order to do so. The first argument is the DBIdriver object for the specific database. For MySQL databases, we will use RMySQL::MySQL(). Each driver will define what other arguments are required, e.g., "dbname" for the database name, "username", and "password".
For connecting to a remote MySQL database you'll specify the following arguments in dbConnect(): dbname, host, port, user and password. We have all these details for our database:
dbname = "enwrap86_ftstockdata"
host = "109.199.108.179"
port = 3306
user = "dbusername"
password = "*********"
Using all these details, we can now create a dbConnect() call to connect to the database. We will store the connection into a variable called con.
1# Load the DBI package2library(DBI)3# Create a database connection4con <- dbConnect(RMySQL::MySQL(),5 dbname ="enwrap86_ftstockdata",6 host ="109.199.108.179",7 port =3306,8 user ="enwrap86_dataadm",9 password ="X+8NHCNrFJGR")10
Once this command is executed successfully, we will have a successful database connection.
List DB Tables
You can fetch a list of all tables in the database using the dbListTables() function. As an argument, you will pass the con connection object.
1> dbListTables(con)2[1]"Prices""Stocks"3
As you can see, it returns the two table names that we had introduced earlier.
Read Data from a Table
We can now use the dbReadTable() function to read the data from these tables. The function takes two arguments 1) the connection object and 2) the table name.
In the following example, we import the data from the 'Stocks' table.
1> dbReadTable(con,'Stocks')2 Symbol Name MarketCap PE EPS NetIncome Beta Dividend DivYield
31 TECK Teck Resources Ltd 13102.040.24-11.490.080.3142 CC Chemours Company 322.560.97-903.450.120.5553 AKS AK Steel Holding Corp 321.330.51-5092.660.000.0064 KEM Kemet Corp 29632.050.20-543.400.000.0075 HIIQ Health Insurance Inn 25319.770.8710.670.000.0086 NVDA Nvidia Corporation 5450.622.016141.310.560.5597 GGB Gerdau S.A.691.250.04-12.330.010.33108 GORO Gold Resource Corp 30654.100.1030.320.020.37119 EVI Envirostarm Inc 14252.780.2720.920.000.001210 SKY Skyline Corp 12334.930.4222.440.000.0013
The resulting data is a data frame. We could store this in a variable and use it for further data manipulation.
In the following example, we import the "Prices" table from the finance_stockdata database and store the resulting data frame as prices. We can then chec the structure of the prices object or do anything we want to with this data.
We can also import all tables from the databse into a list using the lapply() function.
1# Store table names in 'stock_data'2stock_data <- dbListTables(con)3# Import all tables into 'data_tables'4data_tables <- lapply(stock_data, dbReadTable, conn = con)5# Print out tables6data_tables
7
The first argument in lapply() is the table names. The second argument is the function you want to execute for each of these table names, in this case dbReadTable. The third argument is the connection object. You can read more about the lapply() function from R help using help(lapply).
The DBI package also specifies functions to create new tables, add new data into these tables, and to remove tables. However, since our focus in on how to import data, we will not get into these other functions.
Disconnect the Database
Once you're done with your work of importing data and other database related activities, it is important to disconnect the database using the dbDisconnect() function. The function will execute and return TRUE if disconnection was successful.
1> dbDisconnect(con)2[1] TRUE
3
If you try to print the 'con' object now, you will see that it is no longer available.
1> con
2Error in.local(dbObj,...):3 internal error in RS_DBI_getConnection: corrupt connection handle
4
Finance Train members can download the sql file for the database used in the lesson below:
Continue Reading
Premium Content
This tutorial is a part of the course R Programming for Data Science. This is a premium course. The purchase options for the course are provided below. With this course, you get access to complete course content, source code, practical exercises, and all resources that are a part of the course.
Lifetime Premium Membership
$250
$179
Get unlimited access to all courses and premium content