Importing Data from a Database in R

Most companies use relational databases to store their data and it is important for a data scientist to know how they can import data from these data bases into R. Some examples of popular relational databases include MySQL, MS SQL Server, and Oracle. Luckily for us, the R community has created R packages for almost every database that we can use to connect to and retrieve information from them.

In simplest terms, a relational database is one that presents information in tables with rows and columns. A table is referred to as a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can be related according to common keys or concepts, and the ability to retrieve related data from a table is the basis for the term relational database.

In this lesson, we will take an example of a MySQL database. We will connect to the database, fetch the list of tables in the database and then learn how to import data from these tables.

Our Sample Database

Our database is named enwrap86_ftstockdata and contains two tables:

1. Stocks: This table contains stock information that does not change frequently, such as the company name and its earnings per share (EPS).

2. Prices: This table contains stock data, accumulated over a period of time. In this table, we record the following information for each stock: its daily opening and closing price per share, its daily high and low price per share, and the volume of shares traded.

Install The Database Connector Package

Depending on the database you're connecting to, you will need to install the appropriate package. Some examples of R packages are:

  • MySQL - RMySQL
  • PostgresSQL - RPostgresSQL
  • Oracle Database - ROracle

Since we're using MySQL database, we will install RMySQL package. How you interact with the database, that is which R functions you use to access and manipulate the data is specified in another R package called DBI. In more technical terms, DBI is the interface and RMySQL is the implementation. Installing a database package such as RMySQL will also automatically install the DBI package. The RMySQL package will be installed using the install.packages("RMySQL") command.

> install.packages("RMySQL")

Establish a Connection

Now that you have the right packages, you can load the package in the R session and establish a connection with the database. Loading just the DBI package is sufficient.

We use dbConnect() to create a connection between the R session and a SQL database. The function takes many arguments in order to do so. The first argument is the DBIdriver object for the specific database. For MySQL databases, we will use RMySQL::MySQL(). Each driver will define what other arguments are required, e.g., "dbname" for the database name, "username", and "password".

For connecting to a remote MySQL database you'll specify the following arguments in dbConnect(): dbname, host, port, user and password. We have all these details for our database:

dbname = "enwrap86_ftstockdata" host = "109.199.108.179" port = 3306 user = "dbusername" password = "*********"

Using all these details, we can now create a dbConnect() call to connect to the database. We will store the connection into a variable called con.

# Load the DBI package
library(DBI)
# Create a database connection
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "enwrap86_ftstockdata", 
                 host = "109.199.108.179", 
                 port = 3306,
                 user = "enwrap86_dataadm",
                 password = "X+8NHCNrFJGR")

Once this command is executed successfully, we will have a successful database connection.

List DB Tables

You can fetch a list of all tables in the database using the dbListTables() function. As an argument, you will pass the con connection object.

> dbListTables(con)
[1] "Prices" "Stocks"

As you can see, it returns the two table names that we had introduced earlier.

Read Data from a Table

We can now use the dbReadTable() function to read the data from these tables. The function takes two arguments 1) the connection object and 2) the table name.

In the following example, we import the data from the 'Stocks' table.

> dbReadTable(con,'Stocks')
   Symbol                  Name MarketCap     PE  EPS NetIncome Beta Dividend DivYield
1    TECK    Teck Resources Ltd        13 102.04 0.24        -1 1.49     0.08     0.31
2      CC      Chemours Company         3  22.56 0.97       -90 3.45     0.12     0.55
3     AKS AK Steel Holding Corp         3  21.33 0.51      -509 2.66     0.00     0.00
4     KEM            Kemet Corp       296  32.05 0.20       -54 3.40     0.00     0.00
5    HIIQ  Health Insurance Inn       253  19.77 0.87         1 0.67     0.00     0.00
6    NVDA    Nvidia Corporation        54  50.62 2.01       614 1.31     0.56     0.55
7     GGB           Gerdau S.A.         6  91.25 0.04        -1 2.33     0.01     0.33
8    GORO    Gold Resource Corp       306  54.10 0.10         3 0.32     0.02     0.37
9     EVI       Envirostarm Inc       142  52.78 0.27         2 0.92     0.00     0.00
10    SKY          Skyline Corp       123  34.93 0.42         2 2.44     0.00     0.00

The resulting data is a data frame. We could store this in a variable and use it for further data manipulation.

In the following example, we import the "Prices" table from the finance_stockdata database and store the resulting data frame as prices. We can then chec the structure of the prices object or do anything we want to with this data.

> prices <-dbReadTable(con,'Prices')
> str(prices)

Import All Tables

We can also import all tables from the databse into a list using the lapply() function.

# Store table names in 'stock_data'
stock_data <- dbListTables(con)
# Import all tables into 'data_tables'
data_tables <- lapply(stock_data, dbReadTable, conn = con)
# Print out tables
data_tables

The first argument in lapply() is the table names. The second argument is the function you want to execute for each of these table names, in this case dbReadTable. The third argument is the connection object. You can read more about the lapply() function from R help using help(lapply).

The DBI package also specifies functions to create new tables, add new data into these tables, and to remove tables. However, since our focus in on how to import data, we will not get into these other functions.

Disconnect the Database

Once you're done with your work of importing data and other database related activities, it is important to disconnect the database using the dbDisconnect() function. The function will execute and return TRUE if disconnection was successful.

> dbDisconnect(con)
[1] TRUE

If you try to print the 'con' object now, you will see that it is no longer available.

> con
Error in .local(dbObj, ...) : 
  internal error in RS_DBI_getConnection: corrupt connection handle

Finance Train members can download the sql file for the database used in the lesson below:

Lesson Resources