In the previous lesson we learned that we can use the
dbReadTable() function to import a table into R. However, this can be complicated if our table contains millions of records and we are actually interested only in a sub-set of records that meet a certain condition. In such a scenario, we have two choices:
- Import the entire table as a data frame into R and then use R functions such as
subset()to select the data that you’re interested in. This approach can work but is a problem if the original data table is large.
- We can use the
dbGetQuery()function which accepts SQL statements specifying what data we want to fetch. For example, to perform a simple aggregate query, there is no need to pull a database table into R and apply an R function to the data frame. Instead, we issue a select statement and retrieve the results table as a data frame.
Let’s consider our
enwrap86_ftstockdata database which has two tables
Prices. In the following command we fetch all records from the table
Prices just for the stock with symbol
# Load the DBI package library(DBI) # Create a database connection con <- dbConnect(RMySQL::MySQL(), dbname = "enwrap86_ftstockdata", host = "184.108.40.206", port = 3306, user = "enwrap86_dataadm", password = "X+8NHCNrFJGR") # Select all price data for the stock TECK TECK_Prices <- dbGetQuery(con, "SELECT * FROM Prices WHERE Symbol = 'TECK'") # Print TECK Prices TECK_Prices
We can perform all kinds of queries including things such as JOIN where we fetch data combined from multiple tables. For example, let’s say we want to retrieve the name of the stock (Available in Stocks table) along with it’s Maximum Last (closing) prices (From prices table). The common key between the two tables in the ‘Symbol field so, we can create a join between the two tables and retrieve the desired data as shown below:
# get the name of each stock along with it's maximum last price Max_Prices <- dbGetQuery(con, "SELECT s.Name,p.Last as MaxLast FROM Stocks AS s LEFT JOIN Prices AS p ON (p.Symbol = s.Symbol) WHERE p.Last = ( SELECT MAX(Last) FROM Prices AS p2 WHERE p2.Symbol = s.Symbol )") # Print Max Prices Max_Prices
The result will look as follows:
> # Print Max Prices > Max_Prices Name MaxLast 1 Teck Resources Ltd 26.25 2 Chemours Company 25.59 3 AK Steel Holding Corp 11.11 4 Kemet Corp 7.12 5 Health Insurance Inn 20.70 6 Nvidia Corporation 111.77 7 Gerdau S.A. 4.05 8 Gold Resource Corp 5.45 9 Envirostarm Inc 15.25 10 Skyline Corp 15.43 >
You may find this query a bit intimidating, but I promise you we will go indepth into SQL in a course dedicated to mastering SQL queries.
Send – Fetch – Clear
When the result table is huge, we may not want to bring it into R in its entirety, but instead fetch the tuples in batches, possibly reducing the batches to simple summaries before requesting the next batch. Instead of dbGetQuery, we use dbSendQuery to fetch results in batches. The DBI package provides functions to keep track of whether the statement produces output, how many rows were affected by the operation, how many rows have been fetched (if statement is a query), and whether there are more rows to fetch.
In the example below, rather than using
dbReadTable to pull over the entire
Prices table, the
dbSendQuery function is used to send the query to the database without retrieving the results. Then, the
dbFetch function pulls over tuples in blocks. In this example, the first 50 tuples are retrieved, then the next 200, after which we determine that there are more results to be fetched (dbHasCompleted) and clear the results object (dbClearResult) without bringing over any more tuples from the MySQL database.
ps = dbSendQuery(con, "SELECT * FROM Prices;") firstBatch = dbFetch(ps, n = 50) secondBatch = dbFetch(ps, n = 30) dbHasCompleted(ps) dbClearResult(ps)
Let’s take a relook at what’s happening here:
- The actual prices table contains 190 records. This is a small database so we don’t have to fetch it in batches, but for this example, we will use the batch fetching principles which will work really well for large data sets.
dbSendQueryfunction only submits and synchronously executes the SQL query to the database engine. It does not extract any records.
- We then use the
dbFetchfunction to actually fetch the records. We execute this function twice, first to fetch the first 50 records and then to fetch the next 30 records.
- We then use the
dbHasCompletedfunction to check if there still are more records to be fetched. It returns in TRUE/FALSE so you can use it in a loop or condition.
- Finally, we must call
dbClearResultfunction when you finish fetching the records you need.
- In addition, the n = −1 assignment for the parameter specifies that all remaining tuples are to fetched. The fetch function converts each attribute in the result set to the corresponding type in R.
- In addition, dbListResults(con) gives a list of all currently active result set objects for the connection con, and dbGetRowCount(rs) provides a status of the number of rows that have been fetched in the query.
When finished, we free up resources by disconnecting the connection: