Statistics in data science - Febrian Nur Alam

Data science and statistics are closely related; statistics is the foundation of data science. If you read various blog posts on my website, you will find that I have written a lot about the basics of statistics. This is not because I am particularly fond of statistics, but because statistics is indeed the foundation of data science. Data science and statistics are like two sides of a coin; they are essentially inseparable. On this occasion, I will provide the fundamentals of statistics that you can use as a foundation for further learning about data science.

“Facts are stubborn things, but statistics are pliable.”
— Mark Twain

Describing the dataset

To be able to describe the dataset properly to our clients, we must look at the amount of data we have. If the dataset is small, it will be very easy to describe it. For example, let’s say we have sales data for one week. If we create a dataset, it will look like this. It will be easy for you to answer what the highest sales value is. However, if the data we have is very large, we need other methods to present the data.

salles = [100, 53, 78, 90, 21, 77, 89]

When conducting a data science project, the first thing you need to do is understand the characteristics of your data. For example, does your data contain empty data, what is the highest sales value this year, what is the lowest sales this year, and so on. Note that if your dataset contains empty data, you must conduct data cleansing.

If you already know the characteristics of your data, then it’s time to visualize it. Data visualization is important because not all data sets are small; most have hundreds or even hundreds of thousands of pieces of data. It would be very difficult to understand large amounts of data without visualizing it.

Now imagine that you have sales data and advertising costs that you have incurred. It would be very difficult to read such a large amount of data if it were still in raw form and had not been converted through visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# generate tanggal 6 bulan
dates = pd.date_range(start="2025-01-01", periods=180, freq="D")

# biaya iklan (juta) – agak naik pelan
iklan = np.random.normal(loc=15, scale=5, size=180)
iklan = np.clip(iklan, 5, 30)

# penjualan (juta) – dipengaruhi iklan + noise
penjualan = (iklan * 4) + np.random.normal(0, 10, 180)

df = pd.DataFrame({
    "tanggal": dates,
    "iklan": iklan.round(2),
    "penjualan": penjualan.round(2)
})

print(df.head())

After visualizing the data in this way, we can analyze it more easily. For example, if we run an advertisement with a budget of ten million rupiah, what is the estimated sales we can get?

Central tendencies

Central tendencies are values that indicate the middle point of a dataset to determine the distribution of data within that dataset. The mean, median, and mode are the most commonly used measures of central tendency. All three indicate the distribution of data in a dataset, but use different methods.

The main purpose of knowing the central tendency is to understand the distribution of the data. This allows us to determine how to further process the data we have. There are three types of data based on their tendency

Note: If you want to learn how to find the mean, median, and mode, you can use the code below, and the codes in this article are also related. You need to copy all the codes to get the same results as I did.

print("Central Tendency\n")

print("Mean:")
print(df[["iklan", "penjualan"]].mean())

print("\nMedian:")
print(df[["iklan", "penjualan"]].median())

print("\nMode:")
print(df[["iklan", "penjualan"]].mode())

Dispersion

Dispersion, or in Indonesian also known as spread, is a measure of how far a data set is spread or varies from its central value, such as the mean. In addition, dispersion serves to understand how data is distributed. This allows us to compare one data set with another and detect outliers by calculating the quartile and categorizing data outside of this as outliers.

The function of dispersion is to help determine the consistency of two or more datasets. It provides confidence in the average and is also used in economics, business, and scientific research to analyze the variation of datasets.

Central tendency provides a single value (mean, median, mode). Dispersion provides an illustration of how the data is spread out from its central value. To understand how your data is spread out, you can use the code below.

plt.figure()
sns.boxplot(data=df[["iklan", "penjualan"]])
plt.title("Distribusi Iklan dan Penjualan")
plt.show()

A commonly used visualization to illustrate the distribution of data in a dataset is a box plot. In a box plot, the data is divided into three parts: Q1, Q2 or median, and Q3. In addition, you can see circles outside the box, which are called outliers or values that do not belong to Q1, Q2, and Q3.

Correlation

The next statistical concept in data science is correlation. Correlation is a statistical measure that describes how closely two variables are related. Correlation is a commonly used method to show the relationship between variables without explaining the cause and effect.

In general, correlation is measured using a unit that can also be referred to as the correlation coefficient, with a value range between -1 and +1, symbolized by r. If the value is close to 0, there is no relationship between the variables. If the coefficient value is positive, there is a relationship between the variables, and if the value is negative, the relationship between the variables is opposite.

plt.figure()
sns.regplot(x="iklan", y="penjualan", data=df)
plt.title("Hubungan Biaya Iklan vs Penjualan (6 Bulan)")
plt.show()

The relationship between advertising costs and sales shows a positive correlation, as can be seen in the scatter plot above, which shows an upward trend. If we calculate the correlation coefficient, we get a value of 0.89. We can therefore conclude that an increase in advertising costs will lead to an increase in sales.

Simpson paradox

The next statistic in data science is Simpson’s paradox. Actually, cases of Simpson’s paradox are very rare, but I have included this in the discussion so that it can be a learning material for other fellow colleagues. Simpson’s paradox is a statistical phenomenon where a trend is visible in the combined data but not visible when the data is broken down. Simpson’s paradox can cause us to make the wrong decisions if it is not analyzed properly.

For example, there is a company that conducts two types of campaigns for several of its products. The company conducts campaign A and campaign B for several months, and the campaign results are as follows.

Campaign	Total Purchases	Total Visitors	Conversion Rate
A	320	1.200	26.7%
B	270	1.900	14.2%

If we only look at this, we will conclude that campaign A is much more effective than campaign B. However, if we analyze it further, we will get the following results.

Campaign	Purchases	Visitors	Conversion Rate
A	300	1.000	30%
B	90	400	22.5%

Table of cheap product sales

Campaign	Purchase	Visitors	Conversion Rate
A	20	200	10%
B	180	1.500	12%

Table of expensive product sales

It can be seen that in the case of selling expensive products, campaign B is much more effective than campaign A. From this, we can conclude that campaign A is effective for cheap products, while campaign B is more effective for expensive products. If we are not careful in analyzing data, we could make mistakes in decision making.

Correlation and causation

You may have heard the term correlation is not causation. This term is often used in data analysis. If you analyze data on a particular case, you will surely find certain patterns in the data. For example, an increase in sales coincides with an increase in advertising costs. However, in data science, there is one rule that must be noted: not everything that is related is necessarily the cause of that thing. There could be a third factor that causes it to happen.

For example, if I assume that the increase in visitors to my website is a result of me writing more frequently on my blog, then I will increase the number of posts on my blog rather than working on projects and updating them here. However, there is another possibility, which is that data science is currently trending and there are still few people discussing it, so my website has become a reference for learning data science.

One way you can verify whether this is the case is by conducting a randomized trial. This method works by dividing users into two groups and giving them two different treatments. That way, you will know which product users prefer.

That concludes our discussion on statistics in data science. We hope this reading material can serve as a reference for you to learn more about data science in the future.

Also read: Inferential Statistics