Common mistakes in regression analysis

Regression analysis is a versatile method because of its ability to analyze a wide variety of problems, even with small datasets. Like everything else in this world, regression also has its weaknesses. The following are factors that hinder analysis using regression.

S&K Regression

Have you ever had milk coffee? If so, you may be lactose intolerant. If you are lactose intolerant, you will be very careful about what you consume, such as avoiding milk and dairy products. The same applies to regression. For example, if regression is an instant food, there will be a label prohibiting its use of regression in certain conditions. Here are some conditions where regression may not work optimally.

Nonlinearity

The first mistake in regression is linearity; regression only works well if the data being analyzed is linear. Regression can analyze the relationship between discounts and sales, and it can analyze the relationship between promotions and the products we sell. But can it predict the price of a car we have already purchased? Regression cannot do this well. As we all understand, when we buy a car brand new, the resale value of the car we buy will decrease. It takes decades until the point where the car we bought becomes an antique car, and then selling it to a collector will increase the price of the car we sell.

The most important thing here is that if we use regression on data that is not linear, we will use distribution regression or divide the dataset into several sections so that it can be analyzed properly. If we force regression on nonlinear data, the machine will produce an equation that is completely inaccurate.

Multicollinearity

A regression mistake you might make is adding variables with similar effects. This is called multicollinearity, which is when the variables you add to the analysis are super correlated. So the info they give the model overlaps.

Perhaps this error does not have too much of an impact on the model you have created, but if your goal is to analyze the situation, it can be very misleading because the model cannot distinguish the contribution of each variable.

For example, suppose a researcher is studying the relationship between education level and wealth level on health. After conducting the research, it turns out that education and income are highly correlated (people with higher education generally have higher incomes). This has an impact on the model you will build because the model cannot distinguish whether health improves due to education or income.

Correlation ≠ causation

A more serious regression mistake than the two examples above is assuming that everything in the regression analysis is the cause of a particular phenomenon. This is not necessarily the case. In data science and statistics, the term correlation ≠ causation is often used to avoid this mistake. In other words, not everything that is related is the cause of that phenomenon.

For example, if we take data on ice cream sales and the death rate of people swimming at a beach, there will be a trend that as ice cream sales increase. The death rate will also increase. If we always assume that everything related is the cause of other phenomena without conducting further research. We will think that as ice cream consumption increases, the death rate will also increase.

In fact, the main cause of the increase in ice cream sales and death rates at the beach is hot weather. When the weather is hot. People tend to consume more ice cream, and beaches are more crowded during the summer, increasing the potential for deaths.

This false cause-and-effect relationship is one example of many false cause-and-effect relationships. There are many other examples that I cannot mention one by one. This false cause-and-effect relationship can also be called spurious causation.

Reverse causality

The next regression mistake is reverse causality, which occurs when someone thinks that two variables are related, but the direction of the relationship is reversed. For example, someone may think that A causes B, but in reality, B causes A.

An example of reverse causality is when an Instagram celebrity with many followers tends to upload content more frequently. We quickly jump to the conclusion that by uploading content frequently, we can gain more followers. However, the opposite could also be true, as having many followers can motivate someone to upload content more frequently.

According to Katz (2006), understanding reverse causality sometimes only requires “common sense.” Because regression does not explain the direction of causality between variables, it must come from logic and experimentation.

Variable bias

Variable bias occurs when important variables are ignored (or deliberately omitted). As a result, the research results become biased and misleading. Variable bias is a common occurrence in our surroundings. For example, many clickbait news articles intentionally use controversial headlines to attract readers. An example is “Breakfast in the Morning Guarantees Your Child Will Be the Class Champion!!” This ignores other variables such as teaching quality, learning motivation, parental support, and home environment. Breakfast is important, but it is not a guarantee that a child will be intelligent.

Why is variable bias so disruptive? Variable bias is a source of endogeneity. Endogeneity occurs when important variables are not included in the system for analysis. When this happens, problematic variables will enter the system and be analyzed. Ultimately, this will disrupt our ability to analyze and interpret the results of the research.

Extrapolation beyond the data

Extrapolation is a term in statistics used to predict values outside the observed data. Regression can only be used optimally when used in analysis in accordance with the observed data, or what is more commonly known as interpolation.

If there is a regression model with a dataset containing the average income of people of productive age ranging from 22 to 40 years old, a trend can be observed that an increase in age is directly proportional to an increase in income. If this model is forced to predict the income of a 100-year-old person, the result will be a very large number. However, after reaching retirement age, income tends to decrease. Therefore, the regression will produce inaccurate results if used to predict outcomes outside the dataset.

Conclusion

Regression is widely used for analysis due to its ease and ability to analyze. Do you want to predict sales? Do you only have a small dataset to analyze? Almost all data science or statistical problems can be answered using regression.

With the limitations outlined above, I hope this will improve the results of your regression analysis.

Read also: T-distribution