When you start analyzing a new dataset, your first requirement would be to know the variables in the dataset and the relationship between them. A scatter plot is the perfect place to start with. It is the quickest way to view the relationship between any two variables x and y.
You can create a scatter plot using the generic
plot() function in R.
The function itself doesn’t return anything back to the console but instead draws the plot in the plot window.
The two variables, x and y, could be two separate vectors or it could be a dataframe with two columns.
The following example shows the scatter plot created using the
cars dataset. The data gives the speed of cars and the distances taken to stop. It has two columns, speed and dist. The first column, speed, becomes the x-axis and the second column, dist, becomes the y-axis.
As you can see, the graph draws a point for each pair of speed and dist. The general observation from the scatter plot is that the higher the speed, the higher is the distance to stop.
If the dataset contains more than 2 columns, the
plot() function will return multiple scatter plots each representing relationship between two variables.
Let’s take another dataset called
whiteside from the
MASS package. The dataset contains home insulation data containing three variables, namely, Insul, Temp and Gas.
- Insul: A factor, before or after insulation.
- Temp: Purportedly the average outside temperature in degrees Celsius.
- Gas: The weekly gas consumption in 1000s of cubic feet.
If we call the
plot() function on this dataset, it will plot multiple scatter plots representing relationship between all three variables.
> #load the data > data(whiteside,package="MASS") > head(whiteside) Insul Temp Gas 1 Before -0.8 7.2 2 Before -0.7 6.9 3 Before 0.4 6.4 4 Before 2.5 6.0 5 Before 2.9 5.8 6 Before 3.2 5.8 > #plot the data > plot(whiteside)
Each scatter plot draws points for two variables. For example, the graph in the lower left corner has Insul on x-axis and Gas on y-axis. Similarly, the graph in the bottom row, middle column has Temp on x-axis and Gas on y-axis.
plot() function we use above is generic function, that is, it will change its behavior depending on the types of arguments provided and will produce different results. We just saw the use of the
plot() function in its most basic form. In the following lessons, we will see how we can customize the plots and enhance them using various arguments.
Cars93 dataset from the
MASS package and use the
plot() function to draw scatter plots on its variables.
Enhancing a Plot
Now that we know how to create a basic plot using the
plot() function, let’s learn how we can enhance the chart in various ways. We will start with a new dataset which contains daily stock returns for five stocks, namely, Goldman Sachs, Citi, Apple, Facebook and JC Penny for a period of one year. The data is provided in csv format, so we will first load it into R.
Load the Data
I’ve placed the data file in my working directory and then used the
read.csv() function to load the data into an R dataframe called ‘stock_returns’.
> getwd()  "C:/Users/Manish/Documents" > setwd("C:/r-programming/data") > getwd()  "C:/r-programming/data" > stock_returns <- read.csv("stock_returns.csv") > head(stock_returns) Date gs c aapl fb jcp 1 16/04/2015 -0.44 1.52 -0.48 -0.47 -2.69 2 15/04/2015 1.71 0.91 0.38 -0.98 -2.40 3 14/04/2015 1.09 0.13 -0.43 0.61 -2.66 4 13/04/2015 -0.03 0.44 -0.20 1.18 1.95 5 10/04/2015 0.38 0.58 0.43 -0.16 0.22 6 09/04/2015 1.21 0.46 0.76 -0.13 1.32
Create a Scatter Plot
If we just call the
plot() function on ‘stock_returns’ dataset, it will plot multiple scatter plots one for each pair of columns. However, for our use, we will just create a scatter plot for GoldmanSachs and Citi’s stock returns. We can do so by supplying x and y values separately as shown below:
The resulting plot is shown below:
Add Title and Axis Labels
The scatter plot we created is quite plain. We can make it more readable by adding a title and labels to the X and Y axis. We will use the
plot() function arguments to do so.
mainargument for the title
xlabargument for the x-axis
ylabargument for the y-axis
After adding the title and labels, we can also add grid to the graph by calling the
grid() function after calling the
> plot(stock_returns$gs,stock_returns$c, main="Scatter Plot: 1-year Daily Returns", xlab="GoldmanSachs Returns", ylab="Citigroup Returns") > grid()
In the above example, we first plotted the graph and then added the grid to it. The alternative (and preferred) method is to plot the graph using the
plot() function but with the argument
type=n which will prevent the graph from printing. Then we call the
grid() function to add the grid, and then finally call the low-level graphics function such as
lines() to overlay the graph on the grid.
Plot a Regression Line
We can add a regression line to this scatter plot of returns for GoldmanSachs and Citigroup as shown below:
1. Perform a linear regression using lm() on the two variables. lm stands for “linear model”
m <- lm(stock_returns$c ~ stock_returns$gs)
2. Draw the scatter plot
plot(stock_returns$c ~ stock_returns$gs, main="Scatter Plot: 1-year Daily Returns", xlab="GoldmanSachs Returns", ylab="Citigroup Returns")
3. Add the regression line using
The graph will now look as follows:
Notice that we defined the plot as
plot(stock_returns$c ~ stock_returns$gs). This is to keep the order of variables similar to how it is in the
lm() function. The alternative way is
plot(stock_returns$gs,stock_returns$c) that we used earlier.
Load the ‘stock_returns’ dataset into R and create a scatter plot with Apple’s returns on x-axis and Facebook’s returns on y-axis. Then add a title, axis labels and a regression line to the plot.