Visualising Correlations

I find it much easier to understand data visually so I created this simple article to show what various correlation values look like visually.

The article content below has been written in R Markdown. It contains my writing, the R code and the output of the R code. If you are not familiar with R then you can ignore the code and just focus on my writing and the output.

Associated Items;

Understanding Correlation Values

I believe that I am better able to understand a visual representation of data compared to seeing statistical numbers about that data. And I find this particularly true for correlation and standard deviation.

I have created this document to visually represent different correlation values (there is also an equivalent for standard deviation) using generated data that is typical of the world I work in (process control). Actually that is not quite true: I have included a perfectly correlated data set which obviously does not exist in the real world!


Set up the environment

This section creates a dataset to use and loads up some extra libraries for plotting.

Create datasets

For this exercise we are going to great a dataset that has known relationships. We can then apply the correlation tests too.

The dataset will consist of the following fields; * X - Random distribution of 1,000 variables * Y.perfect - A clean, exact derivation based on equivalent X values * Y.tight - As per Y.perfect but with some noise * Y.loose - As per Y.perfect with with quite a lot of noise * Y.outliers.some - As per Y.tight but with some outliers * Y.outliers.lots - As per Y.tight but with lots of outliers * Y.random - Random distribution of 1,000 variables that is not related to X in any way.

        set.seed(20160710)
        X <- rnorm(1000, 225, 50)
        
        Y.perfect <- -1.25 * X  + 800

        Y.tight <- Y.perfect + rnorm(1000, 0, 20)

        Y.loose <- Y.perfect + rnorm(1000, 0, 100)
        
        Y.veryloose <- Y.perfect + rnorm(1000, 0, 175)
        
        Y.outliers.some <- Y.tight
        Y.outliers.some[sample(length(Y.outliers.some), 10)] = runif(10, min = 300, max = 700)
        
        Y.outliers.lots <- Y.tight
        Y.outliers.lots[sample(length(Y.outliers.lots), 100)] = runif(100, min = 300, max = 700)
        
        Y.random <- rnorm(1000, 500, 65)
        
        Analyses.Data <- data.frame(X, Y.perfect, Y.tight, Y.outliers.some, Y.outliers.lots, Y.loose, Y.veryloose, Y.random)
        rm(X, Y.perfect, Y.tight, Y.outliers.some, Y.outliers.lots, Y.loose, Y.veryloose, Y.random)

Libraries

Load up the ggplot packages. These add “grammer of graphics” functionality to R which makes for much nicer plotting.

Multiplot.R is a freely available function to to this is available at [http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)] that allows you to place several plots in a single plot view.

        require(ggplot2)
## Loading required package: ggplot2
        require(ggthemes)
## Loading required package: ggthemes
        source("multiplot.R")

Statistical number data analysis

Let’s see what the correlation function show us?

        round(cor(Analyses.Data)[-1,1], 3)
##       Y.perfect         Y.tight Y.outliers.some Y.outliers.lots 
##          -1.000          -0.956          -0.935          -0.786 
##         Y.loose     Y.veryloose        Y.random 
##          -0.551          -0.341          -0.011

As you might expect the magnitudes reflect how close the data is to the perfect case (-ve in this case because of the inverse relationship).

If you only saw this data would you be happy modelling on say the 78% fit of the lots of outliers case?

It is difficult to tell because the correlation function has summarised the data and removed information that allows you to make this judgement, at least for my understanding.

Visualise data analysis

Lets start again and visualise properly with scatter plots.

        plots <- list()
        
        plots$p1 <- ggplot(Analyses.Data, aes(x = X, y = Y.perfect)) +
                geom_point(colour = "Red") +
                ggtitle("Perfect Relationship") +
                theme_bw()

        plots$p2 <- ggplot(Analyses.Data, aes(x = X, y = Y.tight)) +
                geom_point(colour = "Red") +
                ggtitle("Tight Relationship") +
                theme_bw()
        
        plots$p3 <- ggplot(Analyses.Data, aes(x = X, y = Y.outliers.some)) +
                geom_point(colour = "Red") +
                ggtitle("Tight but Some Outliers") +
                theme_bw()
        
        plots$p4 <- ggplot(Analyses.Data, aes(x = X, y = Y.outliers.lots)) +
                geom_point(colour = "Red") +
                ggtitle("Tight but Lots of Outliers") +
                theme_bw()
        
        plots$p5 <- ggplot(Analyses.Data, aes(x = X, y = Y.loose)) +
                geom_point(colour = "Red") +
                ggtitle("Loose Relationship") +
                theme_bw()
        
        plots$p6 <- ggplot(Analyses.Data, aes(x = X, y = Y.veryloose)) +
                geom_point(colour = "Red") +
                ggtitle("Very Loose Relationship") +
                theme_bw()
        
        plots$p7 <- ggplot(Analyses.Data, aes(x = X, y = Y.random)) +
                geom_point(colour = "Red") +
                ggtitle("No Relationship") +
                theme_bw()

        multiplot(plotlist = plots, cols = 2)