SQL Queries from R

In the previous lesson we learned that we can use the dbReadTable() function to import a table into R. However, this can be complicated if our table contains millions of records and we are actually interested only in a sub-set of records that meet a certain condition. In such a scenario, we have two choices:

  1. Import the entire table as a data frame into R and then use R functions such as subset() to select the data that you're interested in. This approach can work but is a problem if the original data table is large.
  2. We can use the dbGetQuery() function which accepts SQL statements specifying what data we want to fetch. For example, to perform a simple aggregate query, there is no need to pull a database table into R and apply an R function to the data frame. Instead, we issue a select statement and retrieve the results table as a data frame.

Example 1

Let's consider our enwrap86_ftstockdata database which has two tables Stocks and Prices. In the following command we fetch all records from the table Prices just for the stock with symbol TECK.

# Load the DBI package
library(DBI)
# Create a database connection
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "enwrap86_ftstockdata", 
                 host = "109.199.108.179", 
                 port = 3306,
                 user = "enwrap86_dataadm",
                 password = "X+8NHCNrFJGR")
# Select all price data for the stock TECK
TECK_Prices <- dbGetQuery(con, "SELECT * FROM Prices WHERE Symbol = 'TECK'")
# Print TECK Prices
TECK_Prices

Example 2

We can perform all kinds of queries including things such as JOIN where we fetch data combined from multiple tables. For example, let's say we want to retrieve the name of the stock (Available in Stocks table) along with it's Maximum Last (closing) prices (From prices table). The common key between the two tables in the 'Symbol field so, we can create a join between the two tables and retrieve the desired data as shown below:

# get the name of each stock along with it's maximum last price
Max_Prices <- dbGetQuery(con, "SELECT s.Name,p.Last as MaxLast FROM Stocks AS s
    LEFT JOIN Prices AS p ON (p.Symbol = s.Symbol)
    WHERE p.Last = (
        SELECT MAX(Last) FROM Prices AS p2
            WHERE p2.Symbol = s.Symbol
    )")
# Print Max Prices
Max_Prices

The result will look as follows:

> # Print Max Prices
> Max_Prices
                    Name MaxLast
1     Teck Resources Ltd   26.25
2       Chemours Company   25.59
3  AK Steel Holding Corp   11.11
4             Kemet Corp    7.12
5   Health Insurance Inn   20.70
6     Nvidia Corporation  111.77
7            Gerdau S.A.    4.05
8     Gold Resource Corp    5.45
9        Envirostarm Inc   15.25
10          Skyline Corp   15.43
>

You may find this query a bit intimidating, but I promise you we will go indepth into SQL in a course dedicated to mastering SQL queries.

Send - Fetch - Clear

When the result table is huge, we may not want to bring it into R in its entirety, but instead fetch the tuples in batches, possibly reducing the batches to simple summaries before requesting the next batch. Instead of dbGetQuery, we use dbSendQuery to fetch results in batches. The DBI package provides functions to keep track of whether the statement produces output, how many rows were affected by the operation, how many rows have been fetched (if statement is a query), and whether there are more rows to fetch.

In the example below, rather than using dbReadTable to pull over the entire Prices table, the dbSendQuery function is used to send the query to the database without retrieving the results. Then, the dbFetch function pulls over tuples in blocks. In this example, the first 50 tuples are retrieved, then the next 200, after which we determine that there are more results to be fetched (dbHasCompleted) and clear the results object (dbClearResult) without bringing over any more tuples from the MySQL database.

ps = dbSendQuery(con, "SELECT * FROM Prices;")
firstBatch = dbFetch(ps, n = 50)
secondBatch = dbFetch(ps, n = 30)
dbHasCompleted(ps)
dbClearResult(ps)

Let's take a relook at what's happening here:

  • The actual prices table contains 190 records. This is a small database so we don't have to fetch it in batches, but for this example, we will use the batch fetching principles which will work really well for large data sets.
  • The dbSendQuery function only submits and synchronously executes the SQL query to the database engine. It does not extract any records.
  • We then use the dbFetch function to actually fetch the records. We execute this function twice, first to fetch the first 50 records and then to fetch the next 30 records.
  • We then use the dbHasCompleted function to check if there still are more records to be fetched. It returns in TRUE/FALSE so you can use it in a loop or condition.
  • Finally, we must call dbClearResult function when you finish fetching the records you need.
  • In addition, the n = −1 assignment for the parameter specifies that all remaining tuples are to fetched. The fetch function converts each attribute in the result set to the corresponding type in R.
  • In addition, dbListResults(con) gives a list of all currently active result set objects for the connection con, and dbGetRowCount(rs) provides a status of the number of rows that have been fetched in the query.

When finished, we free up resources by disconnecting the connection:

dbDisconnect(con)

Please login to view this lesson.

With our free registration, you can access to all the lessons on finance, risk, data analytics and data science for finance professionals.

Sign in free