Visualising Correlations

I find it much easier to understand data visually so I created this simple article to show what various correlation values look like visually.

The article content below has been written in R Markdown. It contains my writing, the R code and the output of the R code. If you are not familiar with R then you can ignore the code and just focus on my writing and the output.

Associated Items;

Understanding Correlation Values

I believe that I am better able to understand a visual representation of data compared to seeing statistical numbers about that data. And I find this particularly true for correlation and standard deviation.

I have created this document to visually represent different correlation values (there is also an equivalent for standard deviation) using generated data that is typical of the world I work in (process control). Actually that is not quite true: I have included a perfectly correlated data set which obviously does not exist in the real world!


Set up the environment

This section creates a dataset to use and loads up some extra libraries for plotting.

Create datasets

For this exercise we are going to great a dataset that has known relationships. We can then apply the correlation tests too.

The dataset will consist of the following fields; * X - Random distribution of 1,000 variables * Y.perfect - A clean, exact derivation based on equivalent X values * Y.tight - As per Y.perfect but with some noise * Y.loose - As per Y.perfect with with quite a lot of noise * Y.outliers.some - As per Y.tight but with some outliers * Y.outliers.lots - As per Y.tight but with lots of outliers * Y.random - Random distribution of 1,000 variables that is not related to X in any way.

        set.seed(20160710)
        X <- rnorm(1000, 225, 50)
        
        Y.perfect <- -1.25 * X  + 800

        Y.tight <- Y.perfect + rnorm(1000, 0, 20)

        Y.loose <- Y.perfect + rnorm(1000, 0, 100)
        
        Y.veryloose <- Y.perfect + rnorm(1000, 0, 175)
        
        Y.outliers.some <- Y.tight
        Y.outliers.some[sample(length(Y.outliers.some), 10)] = runif(10, min = 300, max = 700)
        
        Y.outliers.lots <- Y.tight
        Y.outliers.lots[sample(length(Y.outliers.lots), 100)] = runif(100, min = 300, max = 700)
        
        Y.random <- rnorm(1000, 500, 65)
        
        Analyses.Data <- data.frame(X, Y.perfect, Y.tight, Y.outliers.some, Y.outliers.lots, Y.loose, Y.veryloose, Y.random)
        rm(X, Y.perfect, Y.tight, Y.outliers.some, Y.outliers.lots, Y.loose, Y.veryloose, Y.random)

Libraries

Load up the ggplot packages. These add “grammer of graphics” functionality to R which makes for much nicer plotting.

Multiplot.R is a freely available function to to this is available at [http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)] that allows you to place several plots in a single plot view.

        require(ggplot2)
## Loading required package: ggplot2
        require(ggthemes)
## Loading required package: ggthemes
        source("multiplot.R")

Statistical number data analysis

Let’s see what the correlation function show us?

        round(cor(Analyses.Data)[-1,1], 3)
##       Y.perfect         Y.tight Y.outliers.some Y.outliers.lots 
##          -1.000          -0.956          -0.935          -0.786 
##         Y.loose     Y.veryloose        Y.random 
##          -0.551          -0.341          -0.011

As you might expect the magnitudes reflect how close the data is to the perfect case (-ve in this case because of the inverse relationship).

If you only saw this data would you be happy modelling on say the 78% fit of the lots of outliers case?

It is difficult to tell because the correlation function has summarised the data and removed information that allows you to make this judgement, at least for my understanding.

Visualise data analysis

Lets start again and visualise properly with scatter plots.

        plots <- list()
        
        plots$p1 <- ggplot(Analyses.Data, aes(x = X, y = Y.perfect)) +
                geom_point(colour = "Red") +
                ggtitle("Perfect Relationship") +
                theme_bw()

        plots$p2 <- ggplot(Analyses.Data, aes(x = X, y = Y.tight)) +
                geom_point(colour = "Red") +
                ggtitle("Tight Relationship") +
                theme_bw()
        
        plots$p3 <- ggplot(Analyses.Data, aes(x = X, y = Y.outliers.some)) +
                geom_point(colour = "Red") +
                ggtitle("Tight but Some Outliers") +
                theme_bw()
        
        plots$p4 <- ggplot(Analyses.Data, aes(x = X, y = Y.outliers.lots)) +
                geom_point(colour = "Red") +
                ggtitle("Tight but Lots of Outliers") +
                theme_bw()
        
        plots$p5 <- ggplot(Analyses.Data, aes(x = X, y = Y.loose)) +
                geom_point(colour = "Red") +
                ggtitle("Loose Relationship") +
                theme_bw()
        
        plots$p6 <- ggplot(Analyses.Data, aes(x = X, y = Y.veryloose)) +
                geom_point(colour = "Red") +
                ggtitle("Very Loose Relationship") +
                theme_bw()
        
        plots$p7 <- ggplot(Analyses.Data, aes(x = X, y = Y.random)) +
                geom_point(colour = "Red") +
                ggtitle("No Relationship") +
                theme_bw()

        multiplot(plotlist = plots, cols = 2)

Much better!

You can see now this is an inverse relationship in all cases except in the “no relationship” case. Even the “very loose” case shows some relationship although I probably would not use this model to justify anything important!

The “lots of outliers” case had only a 78% correlation but a visual representation immediately shows there is actually a strong relationship but with quite a few exceptions. These exceptions could be for all sorts of reasons in the world of process control but the visual scatter plot lets your brain asses the situation much better than a simple “78%” figure.

The loose relationship, which had 50% correlation, shows that there is some relationship. Even the very loose relationship with a 34% correlation shows that there could be a valid relationship but perhaps with lots of hysterises.

Conclusion

Always visualise your data, especially when analysing for relationships.


R session information

        sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggthemes_3.2.0 ggplot2_2.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5      assertthat_0.1   digest_0.6.9     plyr_1.8.4      
##  [5] gtable_0.2.0     formatR_1.4      magrittr_1.5     evaluate_0.9    
##  [9] scales_0.4.0     stringi_1.1.1    rmarkdown_1.0    labeling_0.3    
## [13] tools_3.3.1      stringr_1.0.0    munsell_0.4.3    yaml_2.1.13     
## [17] colorspace_1.2-6 htmltools_0.3.5  knitr_1.13