Indentification of Correlations Uisng R

R can quickly help you identify correlations in datasets. This example using emulated process data walks you through the process.

The article content below has been written in R Markdown. It contains my writing, the R code and the output of the R code. If you are not familiar with R then you can ignore the code and just focus on my writing and the output.

Associated Items;

Identifying Correlations in Process Data using R

Set up the problem

This article demonstrates how the multi-variable modelling in R can both identify relations and determine the linear, steady-state model parameters for a data set of made-up, process type data.

There is an issue with this approach for real world situations: it is ignoring dynamic response. Obviously, you cannot use this approach directly for modelling process control. The purpose of this article is to show that it can be an aid to identifying relationships between sets of variables.

Create a dataset

For this article a fictional dataset will be created. Normally you would not know these relationships in detail!

The dataset is based on the following equations; \[E = 25A - 8B + noise\] \[F = 3A + 11B + noise\] \[G = -4B + 200 + noise\] \[H = 2C + noise\] \[I = 88 + noise\]

Where noise is just a random noise distribution to give the data more realistic feel to process type data.

To create this in R we use the following …

        set.seed(160808)
        variables <- list()
        
        variables$VarA <- rnorm(1000, mean = 30, sd = 4)
        variables$VarB <- rnorm(1000, mean = 32, sd = 5)
        variables$VarC <- rnorm(1000, mean = 22, sd = 3)
        variables$VarD <- rnorm(1000, mean = 9, sd = 1)
        variables$VarE <- 25*variables$VarA - 8*variables$VarB + rnorm(1000, mean = 0, sd = 2)
        variables$VarF <- 3*variables$VarA + 11*variables$VarB + rnorm(1000, mean = 0, sd = 5)
        variables$VarG <- -4*variables$VarB + rnorm(1000, mean = 200, sd = 3)
        variables$VarH <- 2*variables$VarC + rnorm(1000, mean = 0, sd = 7)
        variables$VarI <- rnorm(1000, mean = 88, sd = 7)
        
        dataset <- data.frame(variables)

So lets pretend all we know is that we have 9 variables that we think are related in some way. We could probably guess which are independent and which are dependent, we might think we know which are more significant but we won’t know this for certain. No to try and determine this using R and its linear modelling capabilities.

Load up the libraries

        library(ggplot2)
        library(ggthemes)

Correlation

First have a look at the correlation values. This is a quick way to identify possible relationships.

        cor(dataset)
##              VarA        VarB        VarC         VarD         VarE
## VarA  1.000000000  0.00812150  0.04615119  0.008085218  0.928931854
## VarB  0.008121500  1.00000000 -0.02302034  0.015121510 -0.362190304
## VarC  0.046151187 -0.02302034  1.00000000  0.047575418  0.050827732
## VarD  0.008085218  0.01512151  0.04757542  1.000000000  0.001794263
## VarE  0.928931854 -0.36219030  0.05082773  0.001794263  1.000000000
## VarF  0.219547332  0.97348045 -0.01571671  0.015409345 -0.155296694
## VarG -0.005791332 -0.98720651  0.02523929 -0.017979861  0.359568941
## VarH  0.019817086  0.03954596  0.65380617  0.018470094  0.003157470
## VarI -0.020941965 -0.01911270 -0.04215538  0.006086108 -0.011815439
##             VarF         VarG        VarH         VarI
## VarA  0.21954733 -0.005791332  0.01981709 -0.020941965
## VarB  0.97348045 -0.987206511  0.03954596 -0.019112700
## VarC -0.01571671  0.025239290  0.65380617 -0.042155376
## VarD  0.01540934 -0.017979861  0.01847009  0.006086108
## VarE -0.15529669  0.359568941  0.00315747 -0.011815439
## VarF  1.00000000 -0.960352451  0.04109240 -0.025793276
## VarG -0.96035245  1.000000000 -0.03935183  0.013732955
## VarH  0.04109240 -0.039351827  1.00000000 -0.028419319
## VarI -0.02579328  0.013732955 -0.02841932  1.000000000

This tells us that;

Variable Correlated With… Dataset definition
A E, F Independent, input for E & F
B E, F, G Independent, input for E,F & G
C H Independent, input for H
D no correlations Independent - no relationship
E A, B, F, G Dependent on A & B
F B, G (and A slightly) Dependent on A & B
G B, E, F Dependent on B
H C Dependent on C
I no correlations No relationship

We can see that the correlation analysis has identified all the true relationships However, it has additionally identified other relationships which are a consequence of the way we created the data and the fact that the “cor” function does not understand the concept of independent variables and dependent variables. This means although correlation data is useful it cannot separate cause and effect.

So, for example, it has idenfied that VarG is correlated with VarB which occurs because defined VarG in terms of VarB but it has also identified a correlation between VarG and variables VarE and VarF. This is because VarE and VarF are also based on VarB hence VarB, VarE, VarF and VarG are correlated.

Linear Modelling

After the correlation function another good step is the use of linear modelling to estimate parameter coefficients. Note what we call “variables” are often referred to as “parameters” in the statistics world.

Lets pick one of the correlations and do a multiple parameter linear model fit. This is very easy to do in R using the lm function.

        # fit a model between variable VarG and each of the other variables
        lm.fit <- lm(VarG ~ ., data = dataset)
        summary(lm.fit)
## 
## Call:
## lm(formula = VarG ~ ., data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2874 -1.9809  0.0141  1.8831 11.8918 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.041201   1.981248 101.472   <2e-16 ***
## VarA          0.738803   1.225268   0.603    0.547    
## VarB         -4.345898   0.461661  -9.414   <2e-16 ***
## VarC          0.031073   0.044651   0.696    0.487    
## VarD         -0.062252   0.098688  -0.631    0.528    
## VarE         -0.030309   0.049034  -0.618    0.537    
## VarF          0.009616   0.020774   0.463    0.644    
## VarH         -0.007765   0.014607  -0.532    0.595    
## VarI         -0.013501   0.014273  -0.946    0.344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.168 on 991 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9744 
## F-statistic:  4762 on 8 and 991 DF,  p-value: < 2.2e-16

The summary output of an lm type fit gives… 1. Variable name 2. Estimate of the variables parameter value i.e. the coefficient 3. The standard error of the estimated variable coefficient when applied to the data 4. The “t value” is also called the “t-statistic” and is a ratio of the departure of an estimated parameter from its notional value and its standard error. Wikipedia has more. 5. PR(>|t|) is the probability that the null hypothosis is wrong. Wikipedia has more on this too. ````

This shows that there is probably a signficant model between VarG and VarB since the P-value column indicates a very low number for this coefficient. This is interpretted as “the null hypothosis has very low probability” i.e. the chances of there being no relationship are very low.

The model from the above indicates that the relationship is; \[G = -4.345898B + 201.041201\]

Which is very close to the definition; \[G = -4B + 200\]

The presence of a variable affects the coefficients for the other variables so the uncertain relationships should be removed. Using the P-value column (i.e. Pr(>|t|)) as a guide indicates that all variables apart from B should be removed.

        # fit a model between variable VarG and each of the other variables
        lm.fit <- lm(VarG ~ VarB, data = dataset)
        summary(lm.fit)
## 
## Call:
## lm(formula = VarG ~ VarB, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3645 -1.9923 -0.0116  1.8978 11.9280 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 199.92428    0.65666   304.5   <2e-16 ***
## VarB         -3.99751    0.02044  -195.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.161 on 998 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9746 
## F-statistic: 3.826e+04 on 1 and 998 DF,  p-value: < 2.2e-16

This gives a model of ; \[G = -3.99751B + 199.92428\]

This is very close to the definition for both coefficent and intercept.

We can visualise this in a plot…

        ggplot(dataset, aes(x = VarB, y = VarG)) +
                geom_point(colour = "pink") +
                geom_abline(slope = -3.99751, intercept = 199.92428, colour = "darkred") +
                theme_bw()

Which looks like a good fit.

But this is to be expected! R is good at fitting linear models (and others) to complex sets of data. What I find useful though is that very quickly and easily you can identify that models exist between variables.

Lets now do the same for a more complex varible, VarE.

First fit models for all variables…

        # fit a model between variable VarE and each of the other variables
        lm.fit <- lm(VarE ~ ., data = dataset)
        summary(lm.fit)
## 
## Call:
## lm(formula = VarE ~ ., data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1994 -1.3842  0.0474  1.3897  7.0553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.993179   4.329929   0.691    0.490    
## VarA        24.950715   0.043325 575.897   <2e-16 ***
## VarB        -8.211104   0.171396 -47.907   <2e-16 ***
## VarC        -0.012674   0.028925  -0.438    0.661    
## VarD        -0.013465   0.063933  -0.211    0.833    
## VarF         0.013384   0.013450   0.995    0.320    
## VarG        -0.012716   0.020571  -0.618    0.537    
## VarH        -0.004978   0.009461  -0.526    0.599    
## VarI         0.009285   0.009245   1.004    0.315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.052 on 991 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 3.334e+05 on 8 and 991 DF,  p-value: < 2.2e-16

Now repeat for the more significant relationships (consider the p-values)…

        # fit a model between variable VarG and each of the other variables
        lm.fit <- lm(VarE ~ VarA+VarB, data = dataset)
        summary(lm.fit)
## 
## Call:
## lm(formula = VarE ~ VarA + VarB, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2893 -1.3609  0.0399  1.3823  7.2554 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  0.68647    0.64695    1.061    0.289    
## VarA        24.98941    0.01640 1523.779   <2e-16 ***
## VarB        -8.01282    0.01325 -604.582   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.05 on 997 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 1.336e+06 on 2 and 997 DF,  p-value: < 2.2e-16

Which implies

\[ E = 24.98941A -8.01282B+ 0.68647 \]

Which is very close to the definition; \[ E = 25A -8B + noise \]