Venkat Hebbar

Correlation: Statistical Analysis

A crucial aspect of computer vision and machine learning is the ability to comprehend and interpret data, which allows for informed decision-making in the design process. And the open question is:

What is the best way to effectively understand and analyze data?

The answer is by applying Statistical Techniques and in this article, we are dealing with one of the most important Statistical techniques i.e., Correlation.

Correlation is a statistical measure that describes the relationship between two or more variables. It is a way to determine how closely two variables are related and the direction of that relationship. In other words, it tells us whether two variables change together in the same direction, in opposite directions, or not at all.

Correlation is a term commonly used in everyday language to indicate a link or connection. In statistics, it is a method that demonstrates the degree to which two variables are related and the strength of that relationship. A correlation example would be analyzing the relationship between the number of hours a student studies and their test scores. If the data shows a positive correlation, it would indicate that as the number of hours a student studies increases, their test scores also increase. On the other hand, if the data shows a negative correlation, it would indicate that as the number of hours a student studies increases, their test scores decrease. Fig. 1 depicts the positive, negative and no correlation.

Fig.1 Types of Correlation (Source:Wikipedia)

Significance:

Before implementing any classifier in machine learning, it is important to determine the correlation between intra-intent and inter-intent patterns. Intra-intent patterns, or patterns within the same class, typically have a stronger correlation than inter-intent patterns, or patterns between different classes. In order to effectively classify patterns, it is important to consider the correlation in relation to the problem statement and determine whether the problem at hand is truly one of pattern classification.

  • Understanding Relationships: Correlation can be used to identify the relationship between two or more variables. This can help to understand cause-and-effect relationships and predict future behavior.
  • Identifying Patterns: Correlation can be used to identify patterns and trends in data, which can guide further analysis and decision-making.
  • Predicting Outcomes: Correlation can be used to make predictions about future outcomes based on the relationship between variables.
  • Feature Selection: Correlation can be used to identify highly correlated features in a dataset, which can then be removed to reduce dimensionality and improve model performance.
  • Model Building: Correlation can be used to identify relationships between input variables and the target variable, which can inform the selection of relevant features for building a model.
  • Model Validation: Correlation can be used to validate the performance of a model by comparing the predicted values to the actual values.
  • Data Exploration: Correlation can be used to explore and understand the relationships between variables in a dataset, which can help to identify patterns and trends that can be used to guide further analysis and modeling.
  • Identifying multicollinearity: Correlation is also useful to identify the correlated variables. Which can cause multicollinearity problem in linear regression.

Measurement of Correlation:

The three different methods of measuring correlation between two variables are:

  1. Scatter Diagram
  2. Karl Pearson’s Coefficient of Correlation
  3. Spearman’s Rank Correlation Coefficient

    1. Scatter Diagram

    Scatter Diagram Method is a straightforward and visually appealing method of determining correlation by diagrammatically displaying the bivariate distribution of two variables. This method provides a clear understanding of the relationship between the variables to the analyst. It is a basic method of evaluating the association between two variables, as no numerical calculations are necessary.

    The scatter plot illustrates that there is a positive correlation between the X and Y variables, as the points on the graph tend to move upward and to the right, starting from the lower left corner of the graph.

    2. Karl Pearson’s Coefficient of Correlation

    Pearson correlation is parametric whilst Spearman correlation is nonparametric test. The difference between paramertric and non-parametric as listed in below.

    Pearson correlation: Pearson correlation coefficient (r) is the most employed statistic for measuring the strength of linear relationship between two variables. For instance, in stock market analysis, it can be used to determine the degree of association between two stocks. The formula for calculating Pearson r correlation is as follows:

    r = Pearson r correlation coefficient
    N = number of observations
    ∑xy = sum of the products of paired scores
    ∑x = sum of x scores
    ∑y = sum of y scores
    ∑x2= sum of squared x scores
    ∑y2= sum of squared y scores

    Key Points :

    • Pearson correlation is commonly used when the data is normally distributed, meaning it follows a bell-shaped curve around the mean.
    • Pearson correlation does not assume linearity, but it measures the linear relationship between the variables. This relationship is considered linear when the variables change at a consistent rate.
    • The strength of the association between variables is quantified using a correlation coefficient. This metric assigns a numerical value to the relationship between variables. Correlation coefficients range from -1 to 1. A value of 0 indicates no relationship between the variables, while a value of -1 or 1 represents a perfect negative or positive correlation, respectively.
    Strength of Relationship

    3. Spearman rank correlation: 

    The Spearman rank correlation is a non-parametric method used to determine the correlation between two variables. Unlike other tests, it does not make any assumptions about the distribution of the data. It is best suited for use when the variables are measured on an ordinal or greater scale.

    The following formula is used to calculate the Spearman rank correlation:

    ρ= Spearman rank correlation

    di= the difference between the ranks of corresponding variables
    n= number of observations

    • The Spearman rank correlation method involves a loss of information as it is based on ranks.
    • It is commonly used to measure the monotonic relationship between variables, which means variables tend to move in the same direction but not at a constant rate.
    • If the data contains outliers, i.e. values that are far from the others, using the Spearman rank correlation coefficient is appropriate.

    Key Points:

    • When deciding whether to use Pearson or Spearman rank correlation, it is beneficial to examine a scatter plot.
    • For small sample sizes, it is recommended to use the Spearman rank correlation.
    • For large sample sizes, the Pearson correlation is preferred.
    • Pearson correlation has more statistical power.
    • Pearson correlation allows for better comparison of findings across studies as it is more commonly used.
    • In many cases, the difference between Pearson and Spearman correlation coefficients is minimal.

    Correlation analysis is a very powerful tool for examining relationships in data, but it should only be used when appropriate. To avoid potential issues, it is important to carefully consider the data, examine raw data by plotting it, look out for nonlinear relationships, outliers, and heteroscedasticity of data, and to consider the coefficient of determination instead of just the correlation coefficient.

    Leave a Comment