Importing Data from a Database in R
Most companies use relational databases to store their data and it is important for a data scientist to know how they can import data from these data bases into R. Some examples of popular relational databases include MySQL, MS SQL Server, and Oracle. Luckily for us, the R community has created R packages for almost every database that we can use to connect to and retrieve information from them.
In simplest terms, a relational database is one that presents information in tables with rows and columns. A table is referred to as a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can be related according to common keys or concepts, and the ability to retrieve related data from a table is the basis for the term relational database.
In this lesson, we will take an example of a MySQL database. We will connect to the database, fetch the list of tables in the database and then learn how to import data from these tables.
Our Sample Database
Our database is named enwrap86_ftstockdata
and contains two tables:
1. Stocks: This table contains stock information that does not change frequently, such as the company name and its earnings per share (EPS).
2. Prices: This table contains stock data, accumulated over a period of time. In this table, we record the following information for each stock: its daily opening and closing price per share, its daily high and low price per share, and the volume of shares traded.
Install The Database Connector Package
Depending on the database you're connecting to, you will need to install the appropriate package. Some examples of R packages are:
- MySQL - RMySQL
- PostgresSQL - RPostgresSQL
- Oracle Database - ROracle
Since we're using MySQL database, we will install RMySQL package. How you interact with the database, that is which R functions you use to access and manipulate the data is specified in another R package called DBI. In more technical terms, DBI is the interface and RMySQL is the implementation. Installing a database package such as RMySQL will also automatically install the DBI package. The RMySQL package will be installed using the install.packages("RMySQL")
command.
> install.packages("RMySQL")
Establish a Connection
Now that you have the right packages, you can load the package in the R session and establish a connection with the database. Loading just the DBI
package is sufficient.
We use dbConnect()
to create a connection between the R session and a SQL database. The function takes many arguments in order to do so. The first argument is the DBIdriver object for the specific database. For MySQL databases, we will use RMySQL::MySQL(). Each driver will define what other arguments are required, e.g., "dbname" for the database name, "username", and "password".
For connecting to a remote MySQL database you'll specify the following arguments in dbConnect(): dbname, host, port, user and password. We have all these details for our database:
dbname = "enwrap86_ftstockdata" host = "109.199.108.179" port = 3306 user = "dbusername" password = "*********"
Using all these details, we can now create a dbConnect()
call to connect to the database. We will store the connection into a variable called con
.
1# Load the DBI package
2library(DBI)
3# Create a database connection
4con <- dbConnect(RMySQL::MySQL(),
5 dbname = "enwrap86_ftstockdata",
6 host = "109.199.108.179",
7 port = 3306,
8 user = "enwrap86_dataadm",
9 password = "X+8NHCNrFJGR")
10
Once this command is executed successfully, we will have a successful database connection.
List DB Tables
You can fetch a list of all tables in the database using the dbListTables()
function. As an argument, you will pass the con
connection object.
1> dbListTables(con)
2[1] "Prices" "Stocks"
3
As you can see, it returns the two table names that we had introduced earlier.
Read Data from a Table
We can now use the dbReadTable()
function to read the data from these tables. The function takes two arguments 1) the connection object and 2) the table name.
In the following example, we import the data from the 'Stocks' table.
1> dbReadTable(con,'Stocks')
2 Symbol Name MarketCap PE EPS NetIncome Beta Dividend DivYield
31 TECK Teck Resources Ltd 13 102.04 0.24 -1 1.49 0.08 0.31
42 CC Chemours Company 3 22.56 0.97 -90 3.45 0.12 0.55
53 AKS AK Steel Holding Corp 3 21.33 0.51 -509 2.66 0.00 0.00
64 KEM Kemet Corp 296 32.05 0.20 -54 3.40 0.00 0.00
75 HIIQ Health Insurance Inn 253 19.77 0.87 1 0.67 0.00 0.00
86 NVDA Nvidia Corporation 54 50.62 2.01 614 1.31 0.56 0.55
97 GGB Gerdau S.A. 6 91.25 0.04 -1 2.33 0.01 0.33
108 GORO Gold Resource Corp 306 54.10 0.10 3 0.32 0.02 0.37
119 EVI Envirostarm Inc 142 52.78 0.27 2 0.92 0.00 0.00
1210 SKY Skyline Corp 123 34.93 0.42 2 2.44 0.00 0.00
13
The resulting data is a data frame. We could store this in a variable and use it for further data manipulation.
In the following example, we import the "Prices" table from the finance_stockdata
database and store the resulting data frame as prices
. We can then chec the structure of the prices
object or do anything we want to with this data.
1> prices <-dbReadTable(con,'Prices')
2> str(prices)
3
Import All Tables
We can also import all tables from the databse into a list using the lapply()
function.
1# Store table names in 'stock_data'
2stock_data <- dbListTables(con)
3# Import all tables into 'data_tables'
4data_tables <- lapply(stock_data, dbReadTable, conn = con)
5# Print out tables
6data_tables
7
The first argument in lapply()
is the table names. The second argument is the function you want to execute for each of these table names, in this case dbReadTable
. The third argument is the connection object. You can read more about the lapply()
function from R help using help(lapply)
.
The DBI package also specifies functions to create new tables, add new data into these tables, and to remove tables. However, since our focus in on how to import data, we will not get into these other functions.
Disconnect the Database
Once you're done with your work of importing data and other database related activities, it is important to disconnect the database using the dbDisconnect()
function. The function will execute and return TRUE if disconnection was successful.
1> dbDisconnect(con)
2[1] TRUE
3
If you try to print the 'con' object now, you will see that it is no longer available.
1> con
2Error in .local(dbObj, ...) :
3 internal error in RS_DBI_getConnection: corrupt connection handle
4
Finance Train members can download the sql file for the database used in the lesson below:
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.
You'll get access to:
- Access complete tutorials and examples
- Download source code and resources
- Follow along with practical exercises
- Get in-depth explanations