---
title: "Identifying Correlations in Process Data using R"
author: "Mark Evans"
date: "8 August 2016"
output: html_document
---
# Set up the problem
This article demonstrates how the multi-variable modelling in R can both identify relations and determine the linear, steady-state model parameters for a data set of made-up, process type data.

There is an issue with this approach for real world situations: it is ignoring dynamic response. Obviously, you cannot use this approach directly for modelling process control. The purpose of this article is to show that it can be an aid to identifying relationships between sets of variables.

## Create a dataset
For this article a fictional dataset will be created. Normally you would not know these relationships in detail!

The dataset is based on the following equations;
$$E = 25A - 8B + noise$$
$$F = 3A + 11B + noise$$
$$G = -4B + 200 + noise$$
$$H = 2C + noise$$
$$I = 88 + noise$$

Where noise is just a random noise distribution to give the data more realistic feel to process type data.

To create this in R we use the following ...
```{r Create dataset}

        set.seed(160808)
        variables <- list()
        
        variables$VarA <- rnorm(1000, mean = 30, sd = 4)
        variables$VarB <- rnorm(1000, mean = 32, sd = 5)
        variables$VarC <- rnorm(1000, mean = 22, sd = 3)
        variables$VarD <- rnorm(1000, mean = 9, sd = 1)
        variables$VarE <- 25*variables$VarA - 8*variables$VarB + rnorm(1000, mean = 0, sd = 2)
        variables$VarF <- 3*variables$VarA + 11*variables$VarB + rnorm(1000, mean = 0, sd = 5)
        variables$VarG <- -4*variables$VarB + rnorm(1000, mean = 200, sd = 3)
        variables$VarH <- 2*variables$VarC + rnorm(1000, mean = 0, sd = 7)
        variables$VarI <- rnorm(1000, mean = 88, sd = 7)
        
        dataset <- data.frame(variables)
        
```

So lets pretend all we know is that we have 9 variables that we think are related in some way. We could probably guess which are independent and which are dependent, we might think we know which are more significant but we won't know this for certain. No to try and determine this using R and its linear modelling capabilities.

## Load up the libraries

```{r}
        library(ggplot2)
        library(ggthemes)
```

# Correlation
First have a look at the correlation values. This is a quick way to identify possible relationships.

```{r Correlation values}
        cor(dataset)
```

This tells us that;

Variable   | Correlated With...       | Dataset definition
-----------|--------------------------|-------------------------------
A          | E, F                     | Independent, input for E & F
B          | E, F, G                  | Independent, input for E,F & G
C          | H                        | Independent, input for H
D          | no correlations          | Independent - no relationship
E          | A, B, F, G               | Dependent on A & B
F          | B, G (and A slightly)    | Dependent on A & B
G          | B, E, F                  | Dependent on B
H          | C                        | Dependent on C
I          | no correlations          | No relationship


We can see that the correlation analysis has identified all the true relationships However, it has additionally identified other relationships which are a consequence of the way we created the data and the fact that the "cor" function does not understand the concept of independent variables and dependent variables. This means although correlation data is useful it cannot separate cause and effect.

So, for example, it has idenfied that VarG is correlated with VarB which occurs because defined VarG in terms of VarB but it has also identified a correlation between VarG and variables VarE and VarF. This is because VarE and VarF are also based on VarB hence VarB, VarE, VarF and VarG are correlated.

# Linear Modelling
After the correlation function another good step is the use of linear modelling to estimate parameter coefficients. Note what we call "variables" are often referred to as "parameters" in the statistics world. 

Lets pick one of the correlations and do a multiple parameter linear model fit. This is very easy to do in R using the lm function.

```{r}
        # fit a model between variable VarG and each of the other variables
        lm.fit <- lm(VarG ~ ., data = dataset)
        summary(lm.fit)
```


The summary output of an lm type fit gives...
1. Variable name
2. Estimate of the variables parameter value i.e. the coefficient
3. The standard error of the estimated variable coefficient when applied to the data
4. The "t value" is also called the "t-statistic" and is a ratio of the departure of an estimated parameter from its notional value and its standard error.  [Wikipedia](https://en.wikipedia.org/wiki/T-statistic) has more.
5. PR(>|t|) is the probability that the null hypothosis is wrong. [Wikipedia](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) has more on this too.
````

This shows that there is probably a signficant model between VarG and VarB since the P-value column indicates a very low number for this coefficient. This is interpretted as "the null hypothosis has very low probability" i.e. the chances of there being no relationship are very low.

The model from the above indicates that the relationship is;
$$G = -4.345898B + 201.041201$$

Which is very close to the definition;
$$G = -4B + 200$$

The presence of a variable affects the coefficients for the other variables so the uncertain relationships should be removed. Using the P-value column (i.e. Pr(>|t|)) as a guide indicates that all variables apart from B should be removed.

```{r}
        # fit a model between variable VarG and each of the other variables
        lm.fit <- lm(VarG ~ VarB, data = dataset)
        summary(lm.fit)
```

This gives a model of ;
$$G = -3.99751B + 199.92428$$

This is very close to the definition for both coefficent and intercept.

We can visualise this in a plot...

```{r}
        ggplot(dataset, aes(x = VarB, y = VarG)) +
                geom_point(colour = "pink") +
                geom_abline(slope = -3.99751, intercept = 199.92428, colour = "darkred") +
                theme_bw()
```

Which looks like a good fit.

But this is to be expected! R is good at fitting linear models (and others) to complex sets of data. What I find useful though is that very quickly and easily you can identify that models exist between variables.

Lets now do the same for a more complex varible, VarE.

First fit models for all variables...

```{r}
        # fit a model between variable VarE and each of the other variables
        lm.fit <- lm(VarE ~ ., data = dataset)
        summary(lm.fit)
```

Now repeat for the more significant relationships (consider the p-values)...

```{r}
        # fit a model between variable VarG and each of the other variables
        lm.fit <- lm(VarE ~ VarA+VarB, data = dataset)
        summary(lm.fit)
```

Which implies

$$ E = 24.98941A -8.01282B+ 0.68647 $$

Which is very close to the definition;
$$ E = 25A -8B + noise $$