SQL Queries from R
In the previous lesson we learned that we can use the dbReadTable()
function to import a table into R. However, this can be complicated if our table contains millions of records and we are actually interested only in a sub-set of records that meet a certain condition. In such a scenario, we have two choices:
- Import the entire table as a data frame into R and then use R functions such as
subset()
to select the data that you're interested in. This approach can work but is a problem if the original data table is large. - We can use the
dbGetQuery()
function which accepts SQL statements specifying what data we want to fetch. For example, to perform a simple aggregate query, there is no need to pull a database table into R and apply an R function to the data frame. Instead, we issue a select statement and retrieve the results table as a data frame.
Example 1
Let's consider our enwrap86_ftstockdata
database which has two tables Stocks
and Prices
. In the following command we fetch all records from the table Prices
just for the stock with symbol TECK
.
1# Load the DBI package
2library(DBI)
3# Create a database connection
4con <- dbConnect(RMySQL::MySQL(),
5 dbname = "enwrap86_ftstockdata",
6 host = "109.199.108.179",
7 port = 3306,
8 user = "enwrap86_dataadm",
9 password = "X+8NHCNrFJGR")
10# Select all price data for the stock TECK
11TECK_Prices <- dbGetQuery(con, "SELECT * FROM Prices WHERE Symbol = 'TECK'")
12# Print TECK Prices
13TECK_Prices
14
Example 2
We can perform all kinds of queries including things such as JOIN where we fetch data combined from multiple tables. For example, let's say we want to retrieve the name of the stock (Available in Stocks table) along with it's Maximum Last (closing) prices (From prices table). The common key between the two tables in the 'Symbol field so, we can create a join between the two tables and retrieve the desired data as shown below:
1# get the name of each stock along with it's maximum last price
2Max_Prices <- dbGetQuery(con, "SELECT s.Name,p.Last as MaxLast FROM Stocks AS s
3 LEFT JOIN Prices AS p ON (p.Symbol = s.Symbol)
4 WHERE p.Last = (
5 SELECT MAX(Last) FROM Prices AS p2
6 WHERE p2.Symbol = s.Symbol
7 )")
8# Print Max Prices
9Max_Prices
10
The result will look as follows:
1> # Print Max Prices
2> Max_Prices
3 Name MaxLast
41 Teck Resources Ltd 26.25
52 Chemours Company 25.59
63 AK Steel Holding Corp 11.11
74 Kemet Corp 7.12
85 Health Insurance Inn 20.70
96 Nvidia Corporation 111.77
107 Gerdau S.A. 4.05
118 Gold Resource Corp 5.45
129 Envirostarm Inc 15.25
1310 Skyline Corp 15.43
14>
15
You may find this query a bit intimidating, but I promise you we will go indepth into SQL in a course dedicated to mastering SQL queries.
Send - Fetch - Clear
When the result table is huge, we may not want to bring it into R in its entirety, but instead fetch the tuples in batches, possibly reducing the batches to simple summaries before requesting the next batch. Instead of dbGetQuery, we use dbSendQuery to fetch results in batches. The DBI package provides functions to keep track of whether the statement produces output, how many rows were affected by the operation, how many rows have been fetched (if statement is a query), and whether there are more rows to fetch.
In the example below, rather than using dbReadTable
to pull over the entire Prices
table, the dbSendQuery
function is used to send the query to the database without retrieving the results. Then, the dbFetch
function pulls over tuples in blocks. In this example, the first 50 tuples are retrieved, then the next 200, after which we determine that there are more results to be fetched (dbHasCompleted) and clear the results object (dbClearResult) without bringing over any more tuples from the MySQL database.
1ps = dbSendQuery(con, "SELECT * FROM Prices;")
2firstBatch = dbFetch(ps, n = 50)
3secondBatch = dbFetch(ps, n = 30)
4dbHasCompleted(ps)
5dbClearResult(ps)
6
Let's take a relook at what's happening here:
- The actual prices table contains 190 records. This is a small database so we don't have to fetch it in batches, but for this example, we will use the batch fetching principles which will work really well for large data sets.
- The
dbSendQuery
function only submits and synchronously executes the SQL query to the database engine. It does not extract any records. - We then use the
dbFetch
function to actually fetch the records. We execute this function twice, first to fetch the first 50 records and then to fetch the next 30 records. - We then use the
dbHasCompleted
function to check if there still are more records to be fetched. It returns in TRUE/FALSE so you can use it in a loop or condition. - Finally, we must call
dbClearResult
function when you finish fetching the records you need. - In addition, the n = −1 assignment for the parameter specifies that all remaining tuples are to fetched. The fetch function converts each attribute in the result set to the corresponding type in R.
- In addition, dbListResults(con) gives a list of all currently active result set objects for the connection con, and dbGetRowCount(rs) provides a status of the number of rows that have been fetched in the query.
When finished, we free up resources by disconnecting the connection:
dbDisconnect(con)
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.
You'll get access to:
- Access complete tutorials and examples
- Download source code and resources
- Follow along with practical exercises
- Get in-depth explanations