- Overview of Data Visualization
- When to Use Bar Chart, Column Chart, and Area Chart
- What is Line Chart and When to Use It
- What are Pie Chart and Donut Chart and When to Use Them
- How to Read Scatter Chart and Bubble Chart
- What is a Box Plot and How to Read It
- Understanding Japanese Candlestick Charts and OHLC Charts
- Understanding Treemap, Heatmap and Other Map Charts
- Visualization in Data Science
- Graphic Systems in R
- Accessing Built-in Datasets in R
- How to Create a Scatter Plot in R
- Create a Scatter Plot in R with Multiple Groups
- Creating a Bar Chart in R
- Creating a Line Chart in R
- Plotting Multiple Datasets on One Chart in R
- Adding Details and Features to R Plots
- Introduction to ggplot2
- Grammar of Graphics in ggplot
- Data Import and Basic Manipulation in R - German Credit Dataset
- Create ggplot Graph with German Credit Data in R
- Splitting Plots with Facets in ggplots
- ggplot2 - Chart Aesthetics and Position Adjustments in R
- Creating a Line Chart in ggplot 2 in R
- Add a Statistical Layer on Line Chart in ggplot2
- stat_summary for Statistical Summary in ggplot2 R
- Facets for ggplot2 Charts in R (Faceting Layer)
- Coordinates in ggplot2 in R
- Changing Themes (Look and Feel) in ggplot2 in R
How to Create a Scatter Plot in R
When you start analyzing a new dataset, your first requirement would be to know the variables in the dataset and the relationship between them. A scatter plot is the perfect place to start with. It is the quickest way to view the relationship between any two variables x and y.
You can create a scatter plot using the generic plot()
function in R.
plot(x,y)
The function itself doesn't return anything back to the console but instead draws the plot in the plot window.
The two variables, x and y, could be two separate vectors or it could be a dataframe with two columns.
The following example shows the scatter plot created using the cars
dataset. The data gives the speed of cars and the distances taken to stop. It has two columns, speed and dist. The first column, speed, becomes the x-axis and the second column, dist, becomes the y-axis.
plot(cars)
As you can see, the graph draws a point for each pair of speed and dist. The general observation from the scatter plot is that the higher the speed, the higher is the distance to stop.
If the dataset contains more than 2 columns, the plot()
function will return multiple scatter plots each representing relationship between two variables.
Let's take another dataset called whiteside
from the MASS
package. The dataset contains home insulation data containing three variables, namely, Insul, Temp and Gas.
- Insul: A factor, before or after insulation.
- Temp: Purportedly the average outside temperature in degrees Celsius.
- Gas: The weekly gas consumption in 1000s of cubic feet.
If we call the plot()
function on this dataset, it will plot multiple scatter plots representing relationship between all three variables.
> #load the data
> data(whiteside,package="MASS")
> head(whiteside)
Insul Temp Gas
1 Before -0.8 7.2
2 Before -0.7 6.9
3 Before 0.4 6.4
4 Before 2.5 6.0
5 Before 2.9 5.8
6 Before 3.2 5.8
> #plot the data
> plot(whiteside)
Each scatter plot draws points for two variables. For example, the graph in the lower left corner has Insul on x-axis and Gas on y-axis. Similarly, the graph in the bottom row, middle column has Temp on x-axis and Gas on y-axis.
The plot()
function we use above is generic function, that is, it will change its behavior depending on the types of arguments provided and will produce different results. We just saw the use of the plot()
function in its most basic form. In the following lessons, we will see how we can customize the plots and enhance them using various arguments.
Exercise
Load the Cars93
dataset from the MASS
package and use the plot()
function to draw scatter plots on its variables.
Enhancing a Plot
Now that we know how to create a basic plot using the plot()
function, let's learn how we can enhance the chart in various ways. We will start with a new dataset which contains daily stock returns for five stocks, namely, Goldman Sachs, Citi, Apple, Facebook and JC Penny for a period of one year. The data is provided in csv format, so we will first load it into R.
Load the Data
I've placed the data file in my working directory and then used the read.csv()
function to load the data into an R dataframe called 'stock_returns'.
> getwd()
[1] "C:/Users/Manish/Documents"
> setwd("C:/r-programming/data")
> getwd()
[1] "C:/r-programming/data"
> stock_returns <- read.csv("stock_returns.csv")
> head(stock_returns)
Date gs c aapl fb jcp
1 16/04/2015 -0.44 1.52 -0.48 -0.47 -2.69
2 15/04/2015 1.71 0.91 0.38 -0.98 -2.40
3 14/04/2015 1.09 0.13 -0.43 0.61 -2.66
4 13/04/2015 -0.03 0.44 -0.20 1.18 1.95
5 10/04/2015 0.38 0.58 0.43 -0.16 0.22
6 09/04/2015 1.21 0.46 0.76 -0.13 1.32
Create a Scatter Plot
If we just call the plot()
function on 'stock_returns' dataset, it will plot multiple scatter plots one for each pair of columns. However, for our use, we will just create a scatter plot for GoldmanSachs and Citi's stock returns. We can do so by supplying x and y values separately as shown below:
> plot(stock_returns$gs,stock_returns$c)
The resulting plot is shown below:
Add Title and Axis Labels
The scatter plot we created is quite plain. We can make it more readable by adding a title and labels to the X and Y axis. We will use the plot()
function arguments to do so.
- The
main
argument for the title - The
xlab
argument for the x-axis - The
ylab
argument for the y-axis
After adding the title and labels, we can also add grid to the graph by calling the grid()
function after calling the plot()
function.
> plot(stock_returns$gs,stock_returns$c, main="Scatter Plot: 1-year Daily Returns", xlab="GoldmanSachs Returns", ylab="Citigroup Returns")
> grid()
In the above example, we first plotted the graph and then added the grid to it. The alternative (and preferred) method is to plot the graph using the plot()
function but with the argument type=n
which will prevent the graph from printing. Then we call the grid()
function to add the grid, and then finally call the low-level graphics function such as points()
or lines()
to overlay the graph on the grid.
Plot a Regression Line
We can add a regression line to this scatter plot of returns for GoldmanSachs and Citigroup as shown below:
1. Perform a linear regression using lm() on the two variables. lm stands for "linear model"
m <- lm(stock_returns$c ~ stock_returns$gs)
2. Draw the scatter plot
plot(stock_returns$c ~ stock_returns$gs, main="Scatter Plot: 1-year Daily Returns", xlab="GoldmanSachs Returns", ylab="Citigroup Returns")
3. Add the regression line using abline
function.
abline(m)
The graph will now look as follows:
Notice that we defined the plot as plot(stock_returns$c ~ stock_returns$gs)
. This is to keep the order of variables similar to how it is in the lm()
function. The alternative way is plot(stock_returns$gs,stock_returns$c)
that we used earlier.
Exercise
Load the 'stock_returns' dataset into R and create a scatter plot with Apple's returns on x-axis and Facebook's returns on y-axis. Then add a title, axis labels and a regression line to the plot.
Lesson Resources
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $39 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.