--- title: "Understanding Correlation Values" author: "Mark Evans" date: "10 July 2016" output: html_document --- I believe that I am better able to understand a visual representation of data compared to seeing statistical numbers about that data. And I find this particularly true for correlation and standard deviation. I have created this document to visually represent different correlation values (there is also an equivalent for standard deviation) using generated data that is typical of the world I work in (process control). Actually that is not quite true: I have included a perfectly correlated data set which obviously does not exist in the real world! The document has been written in R Markdown. It contains my writing, the R code and the output of the R code. If you are not familiar with R then you can ignore this and just focus on my writing and the output. Associated Items; * Download this R Markdown document so you can run it on your own machine * R - what is it / how can I get it * R Markdown - what is it / how can I get it ---- # Set up the environment This section creates a dataset to use and loads up some extra libraries for plotting. ## Create datasets For this exercise we are going to great a dataset that has known relationships. We can then apply the correlation tests too. The dataset will consist of the following fields; * X - Random distribution of 1,000 variables * Y.perfect - A clean, exact derivation based on equivalent X values * Y.tight - As per Y.perfect but with some noise * Y.loose - As per Y.perfect with with quite a lot of noise * Y.outliers.some - As per Y.tight but with some outliers * Y.outliers.lots - As per Y.tight but with lots of outliers * Y.random - Random distribution of 1,000 variables that is not related to X in any way. ```{r Create dataset} set.seed(20160710) X <- rnorm(1000, 225, 50) Y.perfect <- -1.25 * X + 800 Y.tight <- Y.perfect + rnorm(1000, 0, 20) Y.loose <- Y.perfect + rnorm(1000, 0, 100) Y.veryloose <- Y.perfect + rnorm(1000, 0, 175) Y.outliers.some <- Y.tight Y.outliers.some[sample(length(Y.outliers.some), 10)] = runif(10, min = 300, max = 700) Y.outliers.lots <- Y.tight Y.outliers.lots[sample(length(Y.outliers.lots), 100)] = runif(100, min = 300, max = 700) Y.random <- rnorm(1000, 500, 65) Analyses.Data <- data.frame(X, Y.perfect, Y.tight, Y.outliers.some, Y.outliers.lots, Y.loose, Y.veryloose, Y.random) rm(X, Y.perfect, Y.tight, Y.outliers.some, Y.outliers.lots, Y.loose, Y.veryloose, Y.random) ``` ## Libraries Load up the ggplot packages. These add "grammer of graphics" functionality to R which makes for much nicer plotting. Multiplot.R is a freely available function to to this is available at [http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)] that allows you to place several plots in a single plot view. ```{r collapse = TRUE} require(ggplot2) require(ggthemes) source("multiplot.R") ``` # Statistical number data analysis Let's see what the correlation function show us? ```{r Correlation} round(cor(Analyses.Data)[-1,1], 3) ``` As you might expect the magnitudes reflect how close the data is to the perfect case (-ve in this case because of the inverse relationship). If you only saw this data would you be happy modelling on say the 78% fit of the lots of outliers case? It is difficult to tell because the correlation function has summarised the data and removed information that allows you to make this judgement, at least for my understanding. # Visualise data analysis Lets start again and visualise properly with scatter plots. ```{r fig.height = 12, fig.width = 12} plots <- list() plots$p1 <- ggplot(Analyses.Data, aes(x = X, y = Y.perfect)) + geom_point(colour = "Red") + geom_smooth() + ggtitle("Perfect Relationship") + theme_bw() plots$p2 <- ggplot(Analyses.Data, aes(x = X, y = Y.tight)) + geom_point(colour = "Red") + ggtitle("Tight Relationship") + theme_bw() plots$p3 <- ggplot(Analyses.Data, aes(x = X, y = Y.outliers.some)) + geom_point(colour = "Red") + ggtitle("Tight but Some Outliers") + theme_bw() plots$p4 <- ggplot(Analyses.Data, aes(x = X, y = Y.outliers.lots)) + geom_point(colour = "Red") + ggtitle("Tight but Lots of Outliers") + theme_bw() plots$p5 <- ggplot(Analyses.Data, aes(x = X, y = Y.loose)) + geom_point(colour = "Red") + ggtitle("Loose Relationship") + theme_bw() plots$p6 <- ggplot(Analyses.Data, aes(x = X, y = Y.veryloose)) + geom_point(colour = "Red") + ggtitle("Very Loose Relationship") + theme_bw() plots$p7 <- ggplot(Analyses.Data, aes(x = X, y = Y.random)) + geom_point(colour = "Red") + ggtitle("No Relationship") + theme_bw() multiplot(plotlist = plots, cols = 2) ``` Much better! You can see now this is an inverse relationship in all cases except in the "no relationship" case. Even the "very loose" case shows some relationship although I probably would not use this model to justify anything important! The "lots of outliers" case had only a 78% correlation but a visual representation immediately shows there is actually a strong relationship but with quite a few exceptions. These exceptions could be for all sorts of reasons in the world of process control but the visual scatter plot lets your brain asses the situation much better than a simple "78%" figure. The loose relationship, which had 50% correlation, shows that there is some relationship. Even the very loose relationship with a 34% correlation shows that there could be a valid relationship but perhaps with lots of hysterises. # Conclusion Always visualise your data, especially when analysing for relationships. ---- # R session information ```{r session info} sessionInfo() ```